ENGAGE will create foundations for a new generation of high-performance computing (HPC) environments for Artificial Intelligence (AI) workloads. The basic premise for these workloads is that in the future, training data for Deep Neural Networks (DNN) will no longer only be stored and processed in epochs, but rather be generated on-the-fly using parametric models and simulations. This is particularly useful in situations where obtaining data by other means is expensive or difficult or where a phenomenon has been predicted in theory, but not yet observed. One key application of this approach is to validate and certify AI systems through targeted testing with synthetically generated data from simulations. We will make contributions on three levels:
- On the application level, we will address the question how the adaptive sampling of parameter spaces will allow for better choices on what data to generate.
- On the middleware level, we will address the question how virtualization and scheduling need to be adapted to facilitate and optimize the execution of resulting mixed workloads consisting of training and simulation tasks, running on potentially hybrid (HPC/cloud/edge) infrastructures.
- On the resource management level, we will contribute novel strategies to optimize memory management and the dynamic choice of parallel resources to run the training and inference phases. In summary, our project will create a blueprint for a new generation of AI compute infrastructures that goes beyond the concept of epoch-based data management and considers model-based online-training of Neural Networks as the new paradigm for DNN applications.