Data-centric AI

Learning with few data or few annotated data

© Fraunhofer IIS

Data-centric AI (DCAI) offers a novel, complementary perspective on AI modeling that shifts the focus from model building to curating high-quality, consistently annotated training datasets. The underlying observation is that in many AI projects, the leverage for improving model performance lies in the curation of the training data used.

DCAI encompasses a wide range of methods such as model-based detection of annotation errors in the training data, creation of consistent multirater annotation systems that enable data annotation with minimal effort and maximum quality, use of weak and semi-supervised learning methods to exploit unannotated data, and human-in-the-loop approaches to iteratively improve models.

 

The groundbreaking successes of artificial intelligence (AI) in tasks such as speech recognition, object recognition, and machine translation are due in part to the availability of enormously large annotated data sets. Annotated data, also called labeled data, contains the label information that makes up the meaning of individual data points and is essential for training machine learning models. In many real-world scenarios, especially in industrial environments, large amounts of data are often available, but they are not annotated or only poorly annotated. This lack of annotated training data is one of the major obstacles to the broad application of AI methods in the industrial environment. Therefore, in the competence pillar »Few Labels Learning«, learning with few annotated data is explored within three focus areas and different domains: meta-learning strategies, semi-supervised learning, and data synthesis.

Meta-learning strategies for pathology and autonomous systems

In the context of the implementation of »meta-learning strategies« among others in the »field of imaging medicine« methods like Few-Shot or Transfer-Learning are developed and researched.

Especially in the field of tissue classification in medicine, the annotation of large data sets is particularly difficult and the data situation often varies from hospital to hospital or from imaging device to imaging device. Therefore, methods of Few-Shot Learning, such as Prototypical Network, are suitable here to generalize between different applications. Transfer learning methods, whereby models are pre-trained on comparable data sets with many annotated data points, in order to then apply them to the actual problem, are also used in this context for the practical evaluation of medical CTs.

Overall, the application »Robust AI for Digital Pathology« uses methods of Few-Shot Learning for interactive tissue classification. Due to the possibility to interact with the models, possible new tissue classes can also be considered.

In the application »AI Framework for Autonomous Systems«, meta-learning strategies for autonomous driving are being researched. Reinforcement learning models are pre-trained in a simulation environment and then adapted for real-world application via transfer learning. In another part of this project, continuous learning is used to train models that can quickly and flexibly adapt to emerging scenarios in autonomous driving.

Semi-supervised learning in a time series context

Semi-supervised learning methods are applied in scenarios where a lot of training data is available, but only a minority of it is annotated. These methods also incorporate latent information from the non-annotated data, such as similarities, into the model training in order to train high-performance machine learning models.

In a subproject, methods from the field of »Consistency Regularization« are researched and developed for the application on sequential sensor data. Furthermore, semi-supervised learning strategies are used in a project in the area of camera-based automated garbage sorting to circumvent the problem of less annotated training data. 

Data synthesis in the field of data-driven localization

The annotation of training data is technically challenging and time-consuming in the area of localization. Becaue of that it is important to develop the right methods to support data mining.
In the application »Data-Driven Localization«, a measurement platform is developed for efficient data annotation using Active Learning and incorporating Predictive Uncertainty in order to suggest measurement points to the user that are expected to yield the greatest information gain.
On the other hand, the generation of the measurement or training data during data-driven localization also requires the inclusion of the statistical distribution or non-linearities of the signal propagation in the environment. Therefore, already during the data annotation phase, attention is paid to a non-uniform distribution of the class sets.
By methods like »SMOTE«, where either training data of the overrepresented class are discarded or additional data from the underrepresented class are augmented, such effects should be compensated.

Digitization and the steady development in artificial intelligence (AI) research, especially in the field of machine learning (ML), are currently benefiting many companies by enabling them to develop new data-driven business models or reduce process costs (e.g. in manufacturing). An important building block for this AI development is the availability of a large amount of high-quality data.

However, the data is not always available in sufficient quantity, complete, error-free or up-to-date, so that it cannot really be used meaningfully for ML applications. Through methods of data augmentation (DA for short) it is possible to significantly improve data quality and quantity. This allows ML models to be used for the first time in special use cases furthermore the results of existing ML models to be optimized.

Few Data Learning is used in application areas where a very small database is available: for example, in the field of image recognition, especially in medical technology for the diagnosis of tissue anomalies, for computer vision applications in image and video production, or for forecasting and optimization applications in production and logistics.

Existing Few Data Learning methods have been developed for very specific data problems and have different objectives. Therefore, the challenge in research and application is to select, combine and further develop the right Few Data Learning methods for a specific use case.

Within the ADA Lovelace Center, the work of the Few Data Learning competency pillar is closely linked to the Few Labels Learning pillar, which focuses on the annotation of large data sets. This is because in practice, the two problems often occur together: When data is missing, incorrect, or not available in adequate quantity, the corresponding annotations are often missing as well. Therefore, methods from the two competence pillars »Few Data Learning« and »Few labels Learning« are often combined.


Data enhancement by using data augmentation

Datenerweiterung_mittels_Datenaugmentierung
© Fraunhofer IIS

In classical machine learning, as much data as possible is needed as training data so that the model learns to solve the corresponding classification or regression task and delivers good results on unseen test data in the evaluation. In contrast, Few Data Learning refers to a set of machine learning methods where the database is very small and has its origin in statistics.

In order to expand the data base, data augmentation is used. Various methods can be used for this, for example, the few existing data are slightly modified or new data points are generated. The augmentation methods depend on the type of data and the problem, usually several methods are used in combination.

The focus of research within the Few Data Learning competency pillar is on:

  • Exploitation of similarities in low-dimensional data sets (e.g., interpolation, imputation, clustering) to fill in missing data
  • Generation of synthetic data by redundancy reduction in high-dimensional data sets (e.g. autoencoders, PCA, dynamic factor models)
  • Simulation of processes and data (e.g. in production with AnyLogic as well as SimPlan or by physical models)

The difference between similarity-based and redundancy-based DA methods is illustrated in the following graphic:

If a time series (here demand in units for a product) is to be extended with a few data points for ML modeling, a search can be made for similar time series that already exist. Subsequently, these data points are taken over directly to extend the short time series. The search for similar data is done, for example, using clustering methods.

If such a direct assignment to similar time series is not possible, difficult or not desired, a synthetic data series can be generated from several available time series instead, which represents the information of the many data series. In this case, redundancies in a high-dimensional data set are used in order to generate a low-dimensional representation of the entire data set.

In some applications, there may even be no data at all for analysis. In this case, simulation models can be used to generate completely synthetic data. For example in production: If an existing production is to be converted to a completely new product, there are no empirical values or data for this yet. In this case, a simulation environment can be created based on the existing data and process modeling, which simulates data for the new products and then analyzes them using ML models. It is particularly important, but also difficult, to simulate data that is as realistic as possible, which does not overestimate or underestimate certain characteristics (e.g. production errors or machine downtimes), so that the ML model can also deliver good results with real data during operation.

Data enhancement in digital pathology

In the application for Robust AI for Digital Pathology, AI methods are being developed to automatically detect colorectal cancer on CT tissue scans. In this process, often only very little data is available. Therefore, within the former competence pillar Few Data Learning, different methods of data augmentation were compared using a multi-scanner database. The goal is to ensure robust tissue analysis in intestinal sections of adenocarcinomas. Different convolutional network architectures were compared with regard to their execution speed and robustness on a multi-scanner database. Robustness is achieved by applying data augmentation, especially color augmentation.

 

Generation of an optimized database for diagnostics in wireless systems

In the application AI-based diagnoses in wireless systems a toolchain for automated detection and prediction of transmission faults in wireless networks is developed. In doing so, a contribution is made to the improvement of fault analysis tools in wireless networks, using spectrum analyzers. Using machine learning based image processing algorithms, individual frames of different wireless technologies as well as interference collisions between frames are detected in real time and classified according to their communication standard. To improve the database, the wireless signals generated with a vector signal generator were further processed and recombined in a specially developed simulation pipeline. Using this approach, an extensive labeled training data set could be accessed.



Clustering method for spare parts forecasting

In the application Self-optimization in adaptive logistics networks, a method was developed that uses clustering to detect similarities in a large data set based on incomplete consumption data of spare parts in logistics and uses available consumption data over a longer time horizon to forecast new spare parts (without a long data history).

»ADA wants to know« Podcast

In our »ADA wants to know« podcast series, the people responsible for the competence pillars are in discussion with ADA and provide insight into their research focuses, challenges, and methods. Here are two episodes, where you can listen to ADA with Few Labels Learning expert Jann Goschenhofer or with Few Data Learning Expert Dr. Christian Menden.

Few Labels Learning with ADA and Jann Goschenhofer

Few Data Learning with Dr. Christian Menden

Our focus areas within AI research

Our work at the ADA Lovelace Center is aimed at developing the following methods and procedures in nine domains of artificial intelligence from an applied perspective.

Automatisches Lernen
© Fraunhofer IIS

Automated Learning covers a large area starting with the automation of feature detection and selection for given datasets as well as model search and optimization, continuing with their automated evaluation, and ending with the adaptive adjustment of models through training data and system feedback.


 

Sequenzbasiertes Lernen
© Fraunhofer IIS

Sequence-based Learning concerns itself with the temporal and causal relationships found in data in applications such as language processing, event processing, biosequence analysis, or multimedia files. Observed events are used to determine the system’s current status, and to predict future conditions. This is possible both in cases where only the sequence in which the events occurred is known, and when they are labelled with exact time stamps.

Erfahrungsbasiertes Lernen
© Fraunhofer IIS

Learning from Experience refers to methods whereby a system is able to optimize itself by interacting with its environment and evaluating the feedback it receives, or dynamically adjusting to changing environmental conditions. Examples include automatic generation of models for evaluation and optimization of business processes, transport flows, or control systems for robots in industrial production.

© Fraunhofer IIS

Data-centric AI (DCAI) offers a new perspective on AI modeling that shifts the focus from model building to the curation of high-quality annotated training datasets, because in many AI projects, that is where the leverage for model performance lies. DCAI offers methods such as model-based annotation error detection, design of consistent multi-rater annotation systems for efficient data annotation, use of weak and semi-supervised learning methods to exploit unannotated data, and human-in-the-loop approaches to improve models and data.

© Fraunhofer IIS

To ensure safe and appropriate adoption of artificial intelligence in fields such as medical decision-making and quality control in manufacturing, it is crucial that the machine learning model is comprehensible to its users. An essential factor in building transparency and trust is to understand the rationale behind the model's decision making and its predictions. The ADA Lovelace Center is conducting research on methods to create comprehensible and trustworthy AI systems in the competence pillar of Trustworthy AI, contributing to human-centered AI for users in business, academia, and society.

© Fraunhofer IIS

Process-aware Learning is the link between process mining, the data-based analysis and modeling of processes, and machine learning. The focus is on predicting process flows, process metrics, and process anomalies. This is made possible by extracting process knowledge from event logs and transferring it into explainable prediction models. In this way, influencing factors can be identified and predictive process improvement options can be defined.

Mathematical optimization plays a crucial role in model-based decision support, providing planning solutions in areas as diverse as logistics, energy systems, mobility, finance, and building infrastructure, to name but a few examples. The Center is expanding its already extensive expertise in a number of promising areas, in particular real-time planning and control.

Semantik
© Fraunhofer IIS

The task of semantics is to describe data and data structures in a formally defined, standardized, consistent and unambiguous manner. For the purposes of Industry 4.0, numerous entities (such as sensors, products, machines, or transport systems) must be able to interpret the properties, capabilities or conditions of other entities in the value chain.

Tiny Machine Learning (TinyML) brings AI even to microcontrollers. It enables low-latency inference on edge devices that typically have only a few milliwatts of power consumption. To achieve this, Fraunhofer IIS is conducting research on multi-objective optimization for efficient design space exploration and advanced compression techniques. Furthermore, hierarchical and informed machine learning, efficient model architectures and genetic AI pipeline composition are explored in our research. We enable the intelligent products of our partners.

© Fraunhofer IIS

Hardware-aware Machine Learning (HW-aware ML) focuses on algorithms, methods and tools to design, train and deploy HW-specific ML models. This includes a wide range of techniques to increase energy efficiency and robustness against HW faults, e.g. robust training for quantized DNN models using Quantization- and Fault-aware Training, and optimized mapping and deployment to specialized (e.g. neuromorphic) hardware. At Fraunhofer IIS, we complement this with extensive research in the field of Spiking Neural Network training, optimization, and deployment.

Other topics of interest

Optimized domain adaptation

This white paper is about optimizing machine learning (ML) models when using data from similar domains. A new two-step domain adaptation is introduced. 

Active Learning

Active learning favors labeling the most informative data samples. The performance of active learning heuristics, however, depends on both the structure of the underlying model architecture and the data. In this white paper, learn about a policy that reflects the best decisions from multiple expert heuristics given the current state of active learning and also learns to select samples in a complementary way that unifies expert strategies.

What the ADA Lovelace Center offers you

 

The ADA Lovelace Center for Analytics, Data and Applications offers - together with its cooperation partners - continuing education programs around concepts, methods and concrete applications in the topic area of data analytics and AI.

Seminars with the following focus topics are offered:

Data Analytics for the Supply Chain

Get an overview of common analytics methods and their use cases in supply chain management (german whitepaper).