#### Workshop / March 13, 2023 - March 14, 2023

## Workshop on Optimization and Machine Learning - Talks and Poster Sessions

Title of talk or poster session |
Abstract |

Application of adversarial robustness | When we deploy models trained by standard training (ST), they work well on natural test data. However, those models cannot handle adversarial test data (also known as adversarial examples) that are algorithmically generated by adversarial attacks. An adversarial attack is an algorithm which applies specially designed tiny perturbations on natural data to transform them into adversarial data, in order to mislead a trained model and let it give wrong predictions. Adversarial robustness is aimed at improving the robust accuracy of trained models against adversarial attacks, which can be achieved by adversarial training (AT). What is AT? Given the knowledge that the test data may be adversarial, AT carefully simulates some adversarial attacks during training. Thus, the model has already seen many adversarial training data in the past, and hopefully it can generalize to adversarial test data in the future. AT has two purposes: (1) correctly classify the data (same as ST) and (2) make the decision boundary thick so that no data lie nearby the decision boundary. In this talk, I will introduce how to leverage adversarial attacks/training for evaluating/enhancing reliabilities of AI-powered tools. |

Importance-weighting approach to distribution shift adaptation | For reliable machine learning, overcoming the distribution shift is one of the most important challenges. In this talk, I will first give an overview of the classical importance weighting approach to distribution shift adaptation, which consists of an importance estimation step and an importance-weighted training step. Then, I will present a more recent approach that simultaneously estimates the importance weight and trains a predictor. Finally, I will discuss a more challenging scenario of continuous distribution shifts, where the data distributions change continuously over time. |

Control and Machine Learning |
In this lecture we shall present some recent results on the interplay between control and Machine Learning, and more precisely, Supervised Learning and Universal Approximation. We adopt the perspective of the simultaneous or ensemble control of systems of Residual Neural Networks (ResNets). Roughly, each item to be classified corresponds to a different initial datum for the Cauchy problem of the ResNets, leading to an ensemble of solutions to be driven to the corresponding targets, associated to the labels, by means of the same control. We present a genuinely nonlinear and constructive method, allowing to show that such an ambitious goal can be achieved, estimating the complexity of the control strategies. This property is rarely fulfilled by the classical dynamical systems in Mechanics and the very nonlinear nature of the activation function governing the ResNet dynamics plays a determinant role. It allows deforming half of the phase space while the other half remains invariant, a property that classical models in mechanics do not fulfill. The turnpike property is also analyzed in this context, showing that a suitable choice of the cost functional used to train the ResNet leads to more stable and robust dynamics. |

Address Practical Challenges in Artificial Intelligence from Medical Domain to General Tasks | Artificial intelligence has achieved much success thanks to abundant training data. However, most of existing algorithms are designed for idea scenario, leaving a large gap before practical implementation. For example, the description is full of ambiguous and uncertainty while the machine learning algorithm requires precise label. In general domain, the real-life scenario involves dark or over-exposure illumination, which is hard for existing computer vision techniques. In this talk, Dr Gu Lin will discuss these gaps and the solutions for it. |

Sharpness-aware minimization as an optimal | Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. In this talk, I will show how SAM can be interpreted as optimizing a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. |

Statistical Inference for Neural Network-based Image Segmentation | Although a vast body of literature relates to image segmentation methods that use deep neural networks (DNNs), less attention has been paid to assessing the statistical reliability of segmentation results. In this study, we interpret the segmentation results as hypotheses driven by DNN (called DNN-driven hypotheses) and propose a method to quantify the reliability of these hypotheses within a statistical hypothesis testing framework. To this end, we introduce a conditional selective inference (SI) framework---a new statistical inference framework for data-driven hypotheses that has recently received considerable attention---to compute exact (non-asymptotic) valid p-values for the segmentation results. To use the conditional SI framework for DNN-based segmentation, we develop a new SI algorithm based on the homotopy method, which enables us to derive the exact (non-asymptotic) sampling distribution of DNN-driven hypothesis. We conduct several experiments to demonstrate the performance of the proposed method. |

Meta Learning and Modularity Towards Systematic Generalization | Is scaling up neural networks enough to tackle uncertainty and build AI systems that are flexible to out-of-distribution data? Current deep neural networks tend to overfit the training distribution without sufficient logical reasoning and generality abilities to tackle novel scenarios. The problem of uncertainty in the predictions of neural networks shows that the world is only partially predictable and a learned neural network cannot generalize to its ever-changing surrounding environments. In this regard, System 2 styles of conscious processing aim to learn stationary representation by decomposing high-level knowledge into cooperative and competing task-specific components in System 1. The few-shot learning in the human brain suggests that System 2 cognition reuses these learned neural components for systematic generalization. To build System 2 embedded learning machines, meta-learning allows us to learn stationary information across the environments experienced by selecting and combining different learning components. It is of great interest to understand how human leverages these reusable modular representations to tackle novel situations and how we can build learning machines that are capable of systematic generalization. To this end, I will demonstrate several pieces of evidence for tackling the problems above with Meta Learning. |

Robust learning enhanced by low dimensional structures | |

Graph Convolutional Networks Provably Benefit from Structural Information: A Feature Learning Perspective | It has been observed empirically that graph neural networks (GNNs) can show better generalization performance compared to a multilayer perceptron (MLP) for structured data. Despite tremendously successful applications of GNNs, the theoretical understanding of graph neural networks is still in its infancy. To this end, this work explores under what circumstances graph neural networks are superior by studying the feature learning of graph neural networks during gradient descent. We provide a characterization of generalization analysis for over-parametrized GNNs and MLP that are trained with gradient descent on data containing signals and noises. We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer GNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, the obtained two-layer MLP can only achieve constant level test loss thus failing to generalize under the same signal-to-noise ratio. These together demonstrate a sharp gap between GNN and MLP in terms of feature learning. To our knowledge, this is the first work to give precise conditions expounding the superiority of graph neural networks trained by gradient descent. |

Optimal transport distances between Gaussian measures and Gaussian processes | Optimal transport (OT) has been attracting much research attention in various fields, in particular machine learning and statistics. It is well-known that the exact OT distances are generally computationally demanding and suffer from the curse of dimensionality. One approach to alleviate these problems is via regularization. In this talk, we present recent results on the entropic regularization of OT in the setting of Gaussian measures and their generalization to the infinite-dimensional setting of Gaussian processes. In these settings, the entropic regularized Wasserstein distances admit closed form expressions, which satisfy many favorable theoretical properties, especially in comparison with the exact distance. In particular, we show that the infinite-dimensional regularized distances can be consistently estimated from the finite-dimensional versions, with dimension-independent sample complexities. The mathematical formulation will be illustrated with numerical experiments on Gaussian processes. |

AI for Social Good - Healthy Aging Support with EEG and fNIRS Neurobiomarker Employing Dynamic Network Analysis and Machine Learning Models | We will present a practical application of machine learning (ML) models within AI for a social good domain for elderly adult dementia onset forecasting. Contemporary neurotechnology applications such as brain-computer interfaces (BCI) and efficient machine learning (ML) algorithms contribute to the well-being improvement of individuals with limited mobility or communication. An extension of the above approaches to a field of neuro-biomarkers of age-related cognitive decline and early onset dementia opens new opportunities to monitor cognitive-behavioral interventions and digital non-pharmacological therapies (NPT). Novel experimental paradigms developed in our lab utilizing combined EEG, fNIRS, and eye-tracking modalities in passive BCI frameworks to evaluate working memory and visuospatial implicit learning allow for the prediction of a mild cognitive impairment (MCI) in the elderly. We will present our recent results from two elderly volunteer pilot study groups in Japan and Poland. A pilot study supports the new experimental paradigms to showcase the vital application of artificial intelligence (AI) for early-onset mild cognitive impairment (MCI) prediction in the elderly. |

Optimization theory of neural networks under the mean-field regime | As an example of the nonlinear Fokker-Planck equation, mean-field Langevin dynamics recently has attracted attention due to its connection to (noisy) gradient descent on infinitely wide neural networks in the mean-field regime. This work gives a concise convergence rate analysis of these dynamics. The key ingredient of our proof is a proximal Gibbs distribution associated with the dynamics, which allows us to develop a simple theory parallel to classical convex optimization. Furthermore, we discuss the optimization dynamics in the mean-field regime through the lens of the primal-dual formulation. |

Improving resolution in deep learning-based estimation of drone position and direction using 3D maps | We propose a method to improve the resolution of drone position and direction estimation based on deep learning using 3D topographic maps in non-GPS environments. The global positioning system (GPS) is typically used to estimate the position of drones flying outdoors. However, it becomes difficult to estimate the position if the signal from GPS satellites is blocked by tall mountains or buildings, or if there are interference signals. To avoid this loss of GPS, we previously developed a learning-based flight area estimation method using 3D topographic maps. With this method, the flight area could be estimated with an accuracy of 98.4% in experiments. However, a resolution of 40 meters square is difficult to use for drone control. Therefore, in this study, we will verify whether it is possible to improve the resolution by multiplexing the area division and the data acquisition direction. We also investigated whether the flight direction of the drone can be detected using a 3D map. Experimental results show that the position estimation was 96.8% accurate at the resolution of 25 meters square, and the direction estimation was 92.6\% accurate for 12-direction estimation. |

Efficient machine learning with tensor networks | Tensor Networks (TNs) are factorizations of high dimensional tensors into networks of many low-dimensional tensors, which have been studied in quantum physics, high-performance computing, and applied mathematics. In recent years, TNs have been increasingly investigated and applied to machine learning and signal processing, due to its significant advances in handling large-scale and high-dimensional problems, model compression in deep neural networks, and efficient computations for learning algorithms. This talk aims to present some recent progress of TNs technology applied to machine learning from perspectives of basic principle and algorithms, novel approaches in unsupervised learning, tensor completion, multi-model learning and various applications in DNN, CNN, RNN and etc. |

Nystrom Method for Accurate and Scalable Implicit Differentiation | The essential difficulty of gradient-based bilevel optimization is to estimate the inverse Hessian vector product of neural networks. This paper proposes to tackle this problem by the Nystrom method and the Woodbury matrix identity, exploiting the low-rankness of the Hessian. Compared to existing methods using iterative approximation, such as conjugate gradient and the Neumann series approximation, the proposed method avoids numerical instability and can be efficiently computed in matrix operations without iterations. As a result, the proposed method works stably in various tasks and is two times faster than iterative approximations. Throughout experiments, including large-scale hyperparameter optimization and meta learning, we demonstrate that the Nystrom method consistently achieves comparable or even superior performance to other approaches. |

Fredholm integral equations for the training of shallow neural networks | We present a novel approach for the training of single-hidden-layer neural networks, based on the approximate solution of associated Fredholm integral equations of the 1. kind by Ritz-Galerkin methods. We show how the functional tensor-train format and Tikhonov regularization can be used to construct continuous counterparts of discrete neural networks with an infinitely large hidden layer. The efficiency and reliability of the introduced approach is illustrated by the practical application to several supervised-learning problems. |

Fast Robust Classifiers for Data Streams | In this paper, we consider classification problems with streaming data that can be modeled by a time series for each class. Current methods for data streams require substantial re-computation after new observations. We develop a method that requires minimal effort to capture new information. For this, we extend the concept of Minimax Probability Machine (MPM) towards classifying data streams, and develop two algorithms: (i) The Adaptable Robust Classifier (AdRC), efficiently re-solves the MPM problem at every time step using updated moments. (ii) The Adjustable Robust Classifier (AjRC), adversarially learns the time series models and provides decision rules to adjust the classifier to new observations. Both methods are robust against the uncertainty inherent in time series. The performance of both of these methods is probed with numerical experiments. |

Computer-Assisted Proofs in Extremal Combinatorics | We study how AI and Optimization can be used to obtain computer-assisted proofs in Extremal Combinatorics. In particular, we will explore the SDP approach based on Flag Algebras as well as bounds obtained through Combinatorial Optimization problems derived from blowup constructions. As an application, we will derive some improved as well as tight bounds on some longstanding open problems going back to Erdős. |

On the structure selection for tensor network decomposition | |

A random subspace Newton method for non-convex optimization | |

Multi-Objective Optimization of Performance and Interpretability of Tabular Supervised Machine Learning Models | We present a model-agnostic framework for jointly optimizing the predictive performance and interpretability of supervised machine learning models for tabular data. Interpretability is quantified via three measures: feature sparsity, interaction sparsity of features, and sparsity of non-monotone feature effects. By treating hyperparameter optimization of a machine learning algorithm as a multi-objective optimization problem, our framework allows for generating diverse models that trade off high performance and ease of interpretability in a single optimization run. Efficient optimization is achieved via augmentation of the search space of the learning algorithm by incorporating feature selection, interaction and monotonicity constraints into the hyperparameter search space. We demonstrate that the optimization problem effectively translates to finding the Pareto optimal set of groups of selected features that are allowed to interact in a model, along with finding their optimal monotonicity constraints and optimal hyperparameters of the learning algorithm itself. We then introduce a novel evolutionary algorithm that can operate efficiently on this augmented search space. In benchmark experiments, we show that our framework is capable of finding diverse models that are highly competitive or outperform state-of-the-art XGBoost or Explainable Boosting Machine models, both with respect to performance and interpretability. |

Efficient standardization of clustering comparison metrics | The most popular metrics for clustering comparison, the adjusted Rand index and the adjusted mutual information, are biased. Standardized variants of these metrics mitigate that bias but lack adoption due to high computational effort. We reduce the computational complexity for the standardized Rand index from O(N^3 R max(R, C)) to O(R C) by careful manipulation of the variance term. We introduce the pairwise standardized mutual information with complexity O(m^2), where m is the number of non-zero elements in the contingency matrix. We show that it has similar properties as the fully standardized mutual information with O(N^3 R max(R, C)) on synthetic and real datasets. |

Optimization on Riemannian Manifolds: How and why? | Riemannian optimization refers to solving optimization problems defined on Riemannian manifolds. Such methods have attracted increasing interest from the machine learning community In this talk I will introduce the basic concepts and challenges of Riemannian optimization motivate their use for machine learning problems |

IALE: Imitating Active Learner Ensembles | In this talk we introduce IALE (short for “Imitating Active Learner Ensembles”). Active learning prioritizes the labeling of the most informative data samples. However, the performance of active learning heuristics depends on both the structure of the underlying model architecture and the data. IALE is an imitation learning scheme that imitates the selection of the best-performing expert heuristic at each stage of the learning cycle in a batch-mode pool-based setting. We use Dagger to train a transferable policy on a dataset and later apply it to different datasets and deep classifier architectures. The policy reflects on the best choices from multiple expert heuristics given the current state of the active learning process and learns to select samples in a complementary way that unifies the expert strategies. Our experiments on well-known image datasets show that we outperform state of the art imitation learners and heuristics. |

Safe Monte Carlo Tree Search Using Learned Safety Critics | Most sequential decision-making tasks in the real world cannot be fully described in a single-objective setting using the Markov Decision Process framework. The Constrained Markov Decision Process framework is an alternative that allows incorporating additional cost functions and cost constraints, apart from the primary objective function. Online planning-based approaches to solving such problems have only been little explored. The current state-of-the-art online planning algorithm called Cost-Constrained Monte Carlo Planning (CC-MCP) uses Monte Carlo cost estimates to avoid constraint violations. This is a high variance estimate and the performance is conservative w.r.t costs. Instead, we learn cost estimates using Temporal Difference learning, a lower variance estimate, in an offline phase a priori to agent deployment. The estimator is called the safety critic and is used during deployment within MCTS to limit the exploration of the search tree and remove unsafe trajectories. We call this approach Safe MCTS. Safe MCTS acts closer to the cost constraint as compared to CC-MCP, and achieves higher rewards with safety. Also, the planner is more efficient requiring fewer planning steps. We also show that under model mismatch between the planner and the real world, our approach is less susceptible to cost violations as compared to CC-MCP. |

Acceleration of Frank-Wolfe Algorithms with Open-Loop Step-Sizes | Frank-Wolfe algorithms (FW) are popular first-order methods for solving constrained convex optimization problems that rely on a linear minimization oracle instead of potentially expensive projection-like oracles. Many works have identified accelerated convergence rates under various structural assumptions on the optimization problem and for specific FW variants when using line-search or short-step, requiring feedback from the objective function. Little is known about accelerated convergence regimes when utilizing open-loop step-size rules, a.k.a. FW with pre-determined step-sizes, which are algorithmically extremely simple and stable. Not only is FW with open-loop step-size rules not always subject to the same convergence rate lower bounds as FW with line-search or short-step, but in some specific cases, such as kernel herding in infinite dimensions, it has been empirically observed that FW with open-loop step-size rules enjoys to faster convergence rates than FW with line-search or short-step. We propose a partial answer to this unexplained phenomenon in kernel herding, characterize a general setting for which FW with open-loop step-size rules converges non-asymptotically faster than with line-search or short-step, and derive several accelerated convergence results for FW with open-loop step-size rules. Finally, we demonstrate that FW with open-loop step-sizes can compete with momentum-based open-loop FW variants. |

Domain Adaptation for Time-Series Classification: A Method and Benchmark | The performance of a machine learning model degrades when it is applied to data from a similar but different domain than the data it has initially been trained on. To mitigate this domain shift problem, domain adaptation (DA) techniques search for an optimal transformation that converts the (current) input data from a source domain to a target domain to learn a domain-invariant representation that reduces domain discrepancy. We propose a novel supervised DA based on two steps. First, we search for an optimal class-dependent transformation from the source to the target domain from a few samples. We consider optimal transport methods such as the earth mover's distance, Sinkhorn transport and correlation alignment. Second, we use embedding similarity techniques to select the corresponding transformation at inference. We use correlation metrics and higher-order moment matching techniques. We conduct an extensive evaluation on time-series datasets with domain shift including. |

Multi-Objective Hyperparameter Optimization -- A Review of the State-of-the-Art and Open Challenges | Hyperparameter optimization constitutes a large part of typical modern machine learning workflows. This arises from the fact that machine learning methods and corresponding preprocessing steps often only yield optimal performance when hyperparameters are properly tuned. But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metrics or constraints must be considered when determining an optimal configuration. Nowadays, models and pipelines are held to a high standard, as the ML process often comes with a number of different stakeholders. While predictive performance measures are still decisive in most cases, models must be reliable, robust, accountable for their decisions, efficient for seamless deployment, and so on. Integrating additional metrics like energy efficiency, robustness or fairness ultimately results in a multi-objective optimization problem. This is often neglected in practice, due to a lack of knowledge and readily available software implementations for multi-objective hyperparameter optimization. We present a survey of the current state-of-the art in multi-objective hyperparameter optimization including methods, applications and challenges with a special emphasis on open topics and directions of current and future research. |

Fredholm integral equations for the training of shallow neural networks | We present a novel approach for the training of single-hidden-layer neural networks, based on the approximate solution of associated Fredholm integral equations of the 1. kind by Ritz-Galerkin methods. We show how the functional tensor-train format and Tikhonov regularization can be used to construct continuous counterparts of discrete neural networks with an infinitely large hidden layer. The efficiency and reliability of the introduced approach is illustrated by the practical application to several supervised-learning problems. |

Meta-Learning Multi-armed bandits for Beam Tracking in 5G and 6G networks | Beamforming-capable antenna arrays with many elements enable higher data rates in next generation 5G and 6G networks. In current practice, analog beamforming uses a codebook of pre-configured beams with each of them radiating towards a specific direction, and a beam management function continuously selects optimal beams for moving user equipments (UEs). However, large codebooks and effects caused by reflections or blockages of beams in the environment make an optimal beam selection challenging. Previous work uses supervised learning and trains classifiers to predict the next best beam based on previously selected beams. In contrast to them we formulate the problem as a Partially Observable Markov Decision Process (POMDP) and model the environment as the codebook itself. At each timestep, we select a candidate beam conditioned on the unobservable optimal beam ID. This frames the beam selection problem as an online search procedure that locates the moving optimal beam ID. A key advantage of our method is its increased flexibility to handle new or unforeseen trajectories and changes in the physical environment while previous work usually overfits to trajectories seen in the dataset. Initial results with complex environmental geometry and different movement patterns for the UEs support the applicability of our approach. |

Decomposition methods for mixed-integer optimal control | |

Orthogonality-based Cut Filtering | Cutting planes are an important element in reducing the solving time of mixed-integer programs (MIPs). Because of the many available generators for these cutting planes, we need good selection policies to avoid increasingly bloating the original problem throughout the solving process while minimizing the impact of discarding cuts. In this talk I present new cut selection methods that employ a dynamic orthogonality criterion to greedily select cuts with a guaranteed minimum increase of the efficacy for any pair of cutting planes in the set. We present detailed computational results for an implementation in SCIP 8 on the MIPLIB2017 benchmark testset. |

Fast Robust Classifiers for Data Streams | In this paper, we consider classification problems with streaming data that can be modeled by a time series for each class. Current methods for data streams require substantial re-computation after new observations. We develop a method that requires minimal effort to capture new information. For this, we extend the concept of Minimax Probability Machine (MPM) towards classifying data streams, and develop two algorithms: (i) The Adaptable Robust Classifier (AdRC), efficiently re-solves the MPM problem at every time step using updated moments. (ii) The Adjustable Robust Classifier (AjRC), adversarially learns the time series models and provides decision rules to adjust the classifier to new observations. Both methods are robust against the uncertainty inherent in time series. The performance of both of these methods is probed with numerical experiments. |

Optimizing Neural Networks with Multi-Objective Bayesian Optimization and Augmented Random Search | Deploying Deep Neural Networks (DNNs) on microcontrollers is a common trend to process the increasing amount of data generated by sensors on embedded devices. Multi-objective optimization approaches can be used to compress DNNs by applying network pruning and weight quantization to minimize the memory footprint (RAM), the number of parameters (ROM) and the number of floating point operations (FLOPs) while maintaining the predictive accuracy. In this paper, we show that existing multi-objective Bayesian optimization (MOBOpt) approaches can fall short in finding optimal candidates on the Pareto front and propose a novel solver based on an ensemble of competing parametric policies trained using an Augmented Random Search reinforcement learning agent. Our methodology aims to find an optimal tradeoff between a DNN's predictive accuracy, memory consumption on a given target system, and computational complexity. Our experiments show that we outperform existing MOBOpt approaches consistently on different data sets and architectures like ResNet18 and MobileNetv3. |

Self-supervised, feasible optimization proxies for economic dispatch problems | |

Robust Few-Shot Learning for Histopathology | One of the biggest challenges in computational pathology is creating large labelled datasets to train specific AI models. Therefore, deep-neural networks that can be adapted to new tasks using only a few labelled examples are desirable. Furthermore, robustness to data heterogeneity caused by differences in the data generation process (e.g. different staining protocols or microscopic scanners) is essential. We present a multi-prototype few-shot model (ProtoNet) for tissue classification. While only data from one clinic is used to train the model, an evaluation on a multi-scanner and multi-center database demonstrates its robustness. This robustness is achieved by employing domain specific data augmentation during training. Furthermore, the influence of prototype selection on the model performance is investigated. To show how the ProtoNet is able to generalize, we successfully adapt the model originally trained to recognize tumor in colon tissue section to the new task of discriminating between urothelial carcinoma, healthy and necrotic tissue based on only few labeled examples. Finally, we have implemented an interactive workflow within our MICAIA® software in which a user can iteratively adapt a model, test it and systematically add labels for regions that were initially misclassified. |

Improving the Performance of Quantum Reinforcement Learning with Classical Post-Processing | Quantum reinforcement learning in the noisy intermediate-scale quantum computing era is often implemented by a variational quantum circuit (VQC) as the function approximator. For approximation in policy space, this requires decoding classical information from the prepared quantum state, in order to select an action for the reinforcement learning agent. Following such a quantum policy gradient approach, we propose a specific action decoding procedure. It is constructed in a way, that it maximizes the amount of information that the agent can extract from measurements in the computational basis. To achieve this, we optimize with respect to a novel quality measure for classical post-processing functions, which is motivated by local and global quantum measurements. Experimental results demonstrate, that the reinforcement learning performance correlates with the introduced measure. This is supported by an analysis of quantities related to the expressibility and trainability of the underlying model. The developed algorithm furthermore is used to successfully perform training on a 5-qubit quantum hardware device. As our method introduces only negligible classical overhead, it suggests itself as a technique to boost the performance of VQC-based reinforcement learning. The underlying concept also has the potential to be transferred to the broader field of quantum machine learning. |

Machine Learning for Power Systems | |

Using Domain Adaptive Neural Networks for Detecting Defects in X-Ray Images | Inspecting cast components for defects is commonly done using X-Ray projections and classical image processing algorithms. This has a drawback in that configuring an inspection system for a new component requires high level of expertise and knowledge. Therefore, we investigate an approach where we train an object detective neural network to detect defects in X-Ray images requiring less expert knowledge for training the model. However, neural networks require a lot of data, and the acquisition of the X-Ray images and the bounding box labels is costly. To reduce the cost of developing a dataset we have built a custom simulation pipeline that produces simulated X-Ray images where extracted real-world defects are inserted. We show that we can train neural networks on simulation datasets combined with a smaller amount of real-world data. We furthermore experiment with treating the simulation and real-world data as separate domains so we can use unsupervised and semi-supervised domain adaption in order to reduce the amount of real-world bounding box labels. |

Combining Active and Semi Supervised Learning to Address Challenges in Real-World Data | Do we need Active Learning? Over the past years the rise of strong self-supervised and semi-supervised methods have given raised doubt to the usability of active learning in limited labelled settings. Some studies show that combination of self- and semi-supervised methods with random selection of data for labelling outperforms some active learning techniques. We address these critique points and while these results are valid in some scenarios, they are often based on well-established benchmark datasets that can overestimate the external validity of the performance improvements. We claim that there is a literature gap in exploring active learning in more realistic scenarios and how active learning in combination performs in these scenarios. Specifically, we emphasize the need for designing semi-supervised methods that can leverage from active learning by categorizing different active learning strategies and assess their strength and weaknesses. We furthermore pose a set of realistic data scenarios where the assumptions of semi-supervised methods are broken, and we show experiments where this leads to the semi-supervised methods underperforming. In those cases, we discuss how different types of active learning can be combined with the semi-supervised method to ensure the best of both worlds. |

Reinforcement Learning for Quantum Circuit Compilation | Near-term quantum applications heavily depend on the availability of quantum compilers that efficiently translate a high-level quantum algorithm into hardware-level operations. Whereas many traditional methods are characterized by high execution and pre-compilation times that limit their applicability for online compilation, recent contributions have shown Reinforcement Learning (RL) to be a promising candidate for mitigating these issues via modeling the problem as a sequential decision-making problem, where an agent successively adds gates to construct the target circuit. However, the problem complexity seems to scale exponentially with the number of qubits, and the multi-qubit setting to this day still needs to be solved. Our method aims to leverage a structured MCTS search of the solution space informed by a Transformer to generalize the method to more difficult compilation tasks. The first results show Decision Transformers as a viable candidate architecture for the task. In our talk, we would like to present the preliminary results of our research and stimulate an open discussion with the research community. |

Computer-Assisted Proofs in Extremal Combinatorics | We study how AI and Optimization can be used to obtain computer-assisted proofs in Extremal Combinatorics. In particular, we will explore the SDP approach based on Flag Algebras as well as bounds obtained through Combinatorial Optimization problems derived from blowup constructions. As an application, we will derive some improved as well as tight bounds on some longstanding open problems going back to Erdős. |

Control and Machine Learning | In this lecture we shall present some recent results on the interplay between control and Machine Learning, and more precisely, Supervised Learning and Universal Approximation.
We adopt the perspective of the simultaneous or ensemble control of systems of Residual Neural Networks (ResNets). Roughly, each item to be classified corresponds to a different initial datum for the Cauchy problem of the ResNets, leading to an ensemble of solutions to be driven to the corresponding targets, associated to the labels, by means of the same control.
We present a genuinely nonlinear and constructive method, allowing to show that such an ambitious goal can be achieved, estimating the complexity of the control strategies.
This property is rarely fulfilled by the classical dynamical systems in Mechanics and the very nonlinear nature of the activation function governing the ResNet dynamics plays a determinant role. It allows deforming half of the phase space while the other half remains invariant, a property that classical models in mechanics do not fulfill.
The turnpike property is also analyzed in this context, showing that a suitable choice of the cost functional used to train the ResNet leads to more stable and robust dynamics.
This lecture is inspired in joint work, among others, with Borjan Geshkovski (MIT), Carlos Esteve (Cambridge), Domènec Ruiz-Balet (IC, London) and Dario Pighin (Sherpa.ai). |