Few Data Learning

Usage of Machine Learning methods despite poor database

Digitization and the steady development in artificial intelligence (AI) research, especially in the field of machine learning (ML), are currently benefiting many companies by enabling them to develop new data-driven business models or reduce process costs (e.g. in manufacturing). An important building block for this AI development is the availability of a large amount of high-quality data.

However, the data is not always available in sufficient quantity, complete, error-free or up-to-date, so that it cannot really be used meaningfully for ML applications. Through methods of data augmentation (DA for short) it is possible to significantly improve data quality and quantity. This allows ML models to be used for the first time in special use cases furthermore the results of existing ML models to be optimized.

Few Data Learning is used in application areas where a very small database is available: for example, in the field of image recognition, especially in medical technology for the diagnosis of tissue anomalies, for computer vision applications in image and video production, or for forecasting and optimization applications in production and logistics.

Existing Few Data Learning methods have been developed for very specific data problems and have different objectives. Therefore, the challenge in research and application is to select, combine and further develop the right Few Data Learning methods for a specific use case.

Within the ADA Lovelace Center, the work of the Few Data Learning competency pillar is closely linked to the Few Labels Learning pillar, which focuses on the annotation of large data sets. This is because in practice, the two problems often occur together: When data is missing, incorrect, or not available in adequate quantity, the corresponding annotations are often missing as well. Therefore, methods from the two competence pillars »Few Data Learning« and »Few labels Learning« are often combined.

 

Data enhancement by using data augmentation

In classical machine learning, as much data as possible is needed as training data so that the model learns to solve the corresponding classification or regression task and delivers good results on unseen test data in the evaluation. In contrast, Few Data Learning refers to a set of machine learning methods where the database is very small and has its origin in statistics.

In order to expand the data base, data augmentation is used. Various methods can be used for this, for example, the few existing data are slightly modified or new data points are generated. The augmentation methods depend on the type of data and the problem, usually several methods are used in combination.

The focus of research within the Few Data Learning competency pillar is on:

  • Exploitation of similarities in low-dimensional data sets (e.g., interpolation, imputation, clustering) to fill in missing data
  • Generation of synthetic data by redundancy reduction in high-dimensional data sets (e.g. autoencoders, PCA, dynamic factor models)
  • Simulation of processes and data (e.g. in production with AnyLogic as well as SimPlan or by physical models)

The difference between similarity-based and redundancy-based DA methods is illustrated in the following graphic:

Datenerweiterung_mittels_Datenaugmentierung
© Fraunhofer IIS

If a time series (here demand in units for a product) is to be extended with a few data points for ML modeling, a search can be made for similar time series that already exist. Subsequently, these data points are taken over directly to extend the short time series. The search for similar data is done, for example, using clustering methods.

If such a direct assignment to similar time series is not possible, difficult or not desired, a synthetic data series can be generated from several available time series instead, which represents the information of the many data series. In this case, redundancies in a high-dimensional data set are used in order to generate a low-dimensional representation of the entire data set.

In some applications, there may even be no data at all for analysis. In this case, simulation models can be used to generate completely synthetic data. For example in production: If an existing production is to be converted to a completely new product, there are no empirical values or data for this yet. In this case, a simulation environment can be created based on the existing data and process modeling, which simulates data for the new products and then analyzes them using ML models. It is particularly important, but also difficult, to simulate data that is as realistic as possible, which does not overestimate or underestimate certain characteristics (e.g. production errors or machine downtimes), so that the ML model can also deliver good results with real data during operation.

Data enhancement in digital pathology

In the application for Robust AI for Digital Pathology, AI methods are being developed to automatically detect colorectal cancer on CT tissue scans. In this process, often only very little data is available. Therefore, within the competency pillar Few Data Learning, different methods of data augmentation were compared using a multi-scanner database. The goal is to ensure robust tissue analysis in intestinal sections of adenocarcinomas. Different convolutional network architectures were compared with regard to their execution speed and robustness on a multi-scanner database. Robustness is achieved by applying data augmentation, especially color augmentation.

Generation of an optimized database for diagnostics in wireless systems

In the application AI-based diagnoses in wireless systems a toolchain for automated detection and prediction of transmission faults in wireless networks is developed. In doing so, a contribution is made to the improvement of fault analysis tools in wireless networks, using spectrum analyzers. Using machine learning based image processing algorithms, individual frames of different wireless technologies as well as interference collisions between frames are detected in real time and classified according to their communication standard. To improve the database, the wireless signals generated with a vector signal generator were further processed and recombined in a specially developed simulation pipeline. Using this approach, an extensive labeled training data set could be accessed.

 

Clustering method for spare parts forecasting

In the application Self-optimization in adaptive logistics networks, a method was developed that uses clustering to detect similarities in a large data set based on incomplete consumption data of spare parts in logistics and uses available consumption data over a longer time horizon to forecast new spare parts (without a long data history).

Our focus areas within AI research

Our work at the ADA Lovelace Center is aimed at developing the following methods and procedures in nine domains of artificial intelligence from an applied perspective.

Automatisches Lernen
© Fraunhofer IIS

Automatic learning covers a vast field that ranges from automated feature recognition and selection for datasets, model search and optimization, or automated evaluation of these processes through to adaptive model adjustment using training data and system feedback. It plays a key role in areas such as assistance systems for data-driven decision support.

Sequenzbasiertes Lernen
© Fraunhofer IIS

Sequence-based learning concerns itself with the temporal and causal relationships found in data in applications such as language processing, event processing, biosequence analysis, or multimedia files. Observed events are used to determine the system’s current status, and to predict future conditions. This is possible both in cases where only the sequence in which the events occurred is known, and when they are labelled with exact time stamps.

Erfahrungsbasiertes Lernen
© Fraunhofer IIS

Experience-based learning refers to methods whereby a system is able to optimize itself by interacting with its environment and evaluating the feedback it receives, or dynamically adjusting to changing environmental conditions. Examples include automatic generation of models for evaluation and optimization of business processes, transport flows, or control systems for robots in industrial production.

Few Labels Learning
© Fraunhofer IIS

Major breakthroughs in AI involving tasks such as language recognition, object recognition or machine translation can be attributed in part to the availability of vast annotated datasets. Yet in many real-life scenarios, particularly in industry, such datasets are much more limited. We therefore conduct research on learning using small annotated datasets in the context of techniques for unsupervised, semi-supervised and transfer learning.

For several years, we have seen unbridled growth in the volume of digital data in existence, giving rise to the field of big data. When this data is used to generate knowledge, there is a need to explain the ensuing results and forecasts to users in a plausible and transparent manner. At the ADA Center, this issue is explored under the heading of explainable learning, with the goal of boosting acceptance for artificial intelligence among users in industry, research and society at large.

Mathematical optimization plays a crucial role in model-based decision support, providing planning solutions in areas as diverse as logistics, energy systems, mobility, finance, and building infrastructure, to name but a few examples. The Center is expanding its already extensive expertise in a number of promising areas, in particular real-time planning and control.

The task of semantics is to describe data and data structures in a formally defined, standardized, consistent and unambiguous manner. For the purposes of Industry 4.0, numerous entities (such as sensors, products, machines, or transport systems) must be able to interpret the properties, capabilities or conditions of other entities in the value chain.

We use few data learning to address key research issues involved in processing and augmenting data, or generating sufficient datasets, for instance in AI applications using material master data in industry. This includes processing flawed datasets and using simulation techniques to generate missing data.

Other topics of interest

Data Analytics for the Supply Chain

Get an overview of common analytics methods and their use cases in supply chain management (german whitepaper).