Objectives – Inria DFKI ENGAGE Project

Objective 1: Our first objective is the creation of a scalable theoretic foundation and software
implementation for the concept of training from synthetic data for situations where data
availability is an issue. In this approach, training data for Neural Networks is generated using
possibly large-scale simulations based on parametric models. Following the concept consistently
leads to the idea of online-training: Rather than organizing the training process in epochs, data is
generated in a continuous stream during the training. Decisions on what data to generate are
made on-the-fly based on automated investigations of the training process. The main expected
outcomes are: (1) publications on the characterization of the parameter space using sensitivity
analysis and image-based metrics, (2) publications on novel algorithms for the sampling of the
parameters space, and (3) a scalable software infrastructure implementing the online-training
approach

Objective 2: Our second objective is to investigate various deployment strategies for complex AI
workflows (e.g., potentially combining online training, simulations and inference, all in parallel
and in real-time) on hybrid execution infrastructures (e.g., combining supercomputers and
cloud/fog/edge systems). This requires scalable and reliable experimentation tools. To this
purpose, we will propose methodologies and supporting tools enabling researchers to: (1)
describe in a representative way the application behavior, (2) reproduce it in a reliable, controlled
environment for extensive experiments, and (3) understand how the end-to-end performance of
applications is correlated to various algorithm-dependent or infrastructure-dependent factors.
The main expected outcomes are: (1) publications describing an experimental, reproducibility
oriented methodology and its validation in practice through novel insights it can enable, and (2)
an associated underlying software framework for experiment deployment, monitoring, and
execution at scale on various relevant scalable infrastructures.

Objective 3: Our objective in WP3 is to design a set of methodological and algorithmic tools for
memory management (both for training and inference) and for the efficient use of heterogeneous
computation resources (especially for inference). The goal is to combine the various techniques
developed by each of the teams (compilation and optimization of kernels, dynamic runtime
scheduling, offloading, re-materialization, model parallelism) in order to propose an efficient
overall solution. The question of the target framework (PyTorch, TensorFlow, Horovod) is still
open and the framework will be chosen according to the results obtained and according to the
opportunities and the constraints of the different existing frameworks. The main expected
outcomes are (1) publications on the proposed methodological and (2) suitable algorithmic tools.