

Artificial Intelligence has established itself in high-performance computing and provides the sciences with a plethora of new methods for analysing data. That is why we provide the LRZ’s AI systems, host partners’ AI platforms and evaluate innovative AI chips for potential use in research.
From the natural sciences to the humanities: Artificial Intelligence (AI) is now in constant interdisciplinary use. Their methods can be used to comprehensively analyse and recalculate Big Data. AI also complements the various methods in high-performance computing and expands conventional simulations to include statistical analyses where mathematics and digital programming continue to hit their limits. With its LRZ AI systems, the LRZ provides researchers with powerful clusters equipped with Graphics Processing Units (GPU).
Strong partnerships create even more opportunities: BayernKI, the Bavarian IT infrastructure for AI and the European AI factory HammerHAI provides additional AI resources and supplement the range of training courses and workshops. Last but by no means least, partnerships with the Munich Centre for Machine Learning (MCML) and the German Aerospace Centre (DLR) are improving their range of AI products as well as their experience with AI.
The new technology is developing dynamically, but the energy requirements of AI are very high, and AI models are often unreliable: This raises research questions for which the LRZ is seeking answers in collaboration with renowned and international institutes.
Analysing measured values, developing models, calculating simulations: If research can process its observations efficiently, we will be able to increase our knowledge more quickly and better solve the challenges of the present. That is why we do more than just provide scientists and researchers with high-performance technology and innovative tools so that they can reduce the time it takes to acquire knowledge from research. Above all, we focus on personal support and a comprehensive, up-to-date training programme: At LRZ, you can learn programming languages as well as learn how to use high-performance and AI systems. And it goes without saying that LRZ specialists are continuously optimising the computing power available, the infrastructure for data management and the user-friendliness of technology and tools.
The LRZ’s AI systems are made up of various, high-performance cluster segments with Graphics Processing Units (GPU) from e.g. NVIDIA. The LRZ provides a practical software stack for developing your own models as well as tried and tested data sets and common AI applications such.
Cerebras Systems has developed a chip specially designed for working with and training large language models: The Wafer Scale Engine 2 (WSE-2). It is the size of a pizza box and contains 2.6 billion transistors, 850,000 computing cores and around 40 gigabytes of fast, static main memory. When working with Large Language Models (LLM), data can flow here and results can be temporarily stored and recalculated. Together with researchers, the LRZ is evaluating a CS-2 system with the WSE for its potential use in science.
Not only does AI need powerful computers, it also needs flexible storage solutions. Regardless whether we’re talking about computer vision or language models – anyone who works with AI needs to prepare AI systems for their tasks. The training runs iteratively or in batch mode, switching from calculation or analysis to storing immediate results. The LRZ’s AI systems have amongst others an interactive web server for these work steps and can be connected to the Data Storage System (DSS) and LRZ Cloud Storage. This creates a flexible, scaleable environment equipped to process a wide range of Big Data tasks.
Researching and developing AI methods and models in Bavaria: The Free State of Bavaria first established around 130 AI professorships as part of its high-tech agenda, followed by BayernKI: This high-performance infrastructure is housed at both the LRZ and Erlangen National High-Performance Computing Centre (NHR@FAU) and connected by a fast data line. Here, researchers of all disciplines get fast and flexibly scaleable access to AI technology.
EuroHPC Joint Undertaking is funding an AI factory in Germany of which the LRZ is a partner: HammerHAI is expected to meet the demand for computing resources for AI methods in research, industry and the public sector. To this end, the High-Performance Computing Centre (HLRS) in Stuttgart is working with the LRZ and other research institutes to create AI-optimised supercomputing infrastructure and provide support. The LRZ advises users and runs training courses on how to use AI.
The LRZ is closely linked to the Munich Centre for Machine Learning (MCML), not just through its Board of Directors but also by hosting AI technology for the MCML. This is structured in a similar way to the LRZ’s AI systems and, if required, can be quickly connected should more computing power be needed for AI applications.
They should be trustworthy, as versatile as possible and useful for research: However, we’re still a long way off achieving this dream when it comes to AI applications. The Trillion Parameter Consortium (TPC) was founded by the Argonne National Lab in the US and brings together around 60 research institutes from around the world to develop reliable, generative AI models for science.
Future exascale high-performance computers require a new kind of software that allows dynamic workloads to run with maximum energy efficiency on the best-suited hardware available in the system. The Technical University of Munich (TUM) and the Leibniz Supercomputing Centre (LRZ) are working together to create a production-ready software stack to enable low- energy, high-performance exascale systems.
Exascale supercomputers are knocking at the door. They might be a game-changer to the way we design and use high-performance computing (HPC) systems. As exascale performance drives more HPC systems toward heterogenous architectures that mix traditional CPUs with accelerators like GPUs and FPGAs, computational scientists will have to design more dynamic applications and workloads in order to get massive performance increases for their applications.
The question is: How will applications leverage these different technologies efficiently and effectively Power management and dynamic resource allocation will become the most important aspects of this new area of HPC. Stated more simply: how do HPC centres ensure that users are getting the most science per Joule
Optimizing application performance on heterogeneous systems under power and energy constraints poses some challenges. Some are quite sophisticated, like the dynamic phase behaviour of applications. And some are basic hardware issues like the variability of processors: Due to manufacturing limitations, low-power operation of CPUs can cause a wide variety of frequencies across the cores. Adding to these is the ever-growing complexity and heterogeneity on node-level.
A software stack for such heterogeneous exascale systems will have to meet some specific demands. It has to be dynamic, work with highly heterogeneous integrated systems, and adapt to existing hardware. TUM and LRZ are working closely together to build a software stack based on existing and proven solutions. Among others, MPI and its various implementations, SLURM, PMIx or DCDB are well-known parts of this Munich Software Stack.
“The basic stack is already running on the SuperMUC-NG supercomputer at the LRZ”, says Martin Schulz, Chair for Computer Architecture and Parallel Systems at the Technical University of Munich and Director at the Leibniz Supercomputing Centre. “Right now, we are engaged in two European research projects for further development of this stack on more heterogeneous, deeper integrated and dynamic systems, as they will become commonplace in the exascale era: REGALE and DEEP-SEA.” One of the foundations for the next generation of this software stack is the HPC PowerStack[1], an initiative, with TUM as one of the co-founders, for better standardization and homogenization of approaches for power and energy optimized systems.
REGALE aims to define an open architecture, build a prototype system, and incorporate in this system appropriate sophistication in order to equip supercomputing systems with the mechanisms and policies for effective resource utilization and execution of complex applications. DEEP-SEA will deliver the programming environment for future European exascale systems, capable of adapting at all levels of the software stack. While the basic technologies will be implemented and used in DEEP-SEA, the control chain will play a major role in REGALE.
Both projects are focused on making existing codes more dynamic so they can leverage existing accelerators: Many codes today are static and might only be partially ready for more dynamics. This will require some refactoring, and in some cases, complete rewrites of certain parts of the codes. But it will also require novel and elaborate scheduling methods that must be developed by HPC centres themselves. Part of the upcoming research in DEEP-SEA and REGALE will be to find ways to determine where targeted efforts on top of an existing software stack can yield the greatest result. To this end, agile development approaches will play a role: Continuous Integration with elaborate testing and automation are being established on BEAST (Bavarian Energy-, Architecture- and Software-Testbed) at the LRZ, the testbed for the Munich Software Stack.
“Most research in the field of power and energy management today is done site-specific,” Schulz said. “We see little integration of the components; we have a lack of standardized interfaces that work on all layers of the software stack. In the end, this leads to suboptimal performance of the applications and increases the power needed by the system.. With the Munich Software Stack, TUM and LRZ are working on an open, holistic, and scalable approach to an integrated power and energy management in order to get the most out of supercomputers to come.”
We research the latest computer and storage technologies as well as Internet tools. In collaboration with partners, we develop technologies for future computing, energy-efficient computing and IT security, as well as tools for data analysis and the development of artificial intelligence systems. Here is an overview of all our research projects.
Future exascale high-performance computers require a new kind of software that allows dynamic workloads to run with maximum energy efficiency on the best-suited hardware available in the system. The Technical University of Munich (TUM) and the Leibniz Supercomputing Centre (LRZ) are working together to create a production-ready software stack to enable low- energy, high-performance exascale systems.
Exascale supercomputers are knocking at the door. They might be a game-changer to the way we design and use high-performance computing (HPC) systems. As exascale performance drives more HPC systems toward heterogenous architectures that mix traditional CPUs with accelerators like GPUs and FPGAs, computational scientists will have to design more dynamic applications and workloads in order to get massive performance increases for their applications.
The question is: How will applications leverage these different technologies efficiently and effectively Power management and dynamic resource allocation will become the most important aspects of this new area of HPC. Stated more simply: how do HPC centres ensure that users are getting the most science per Joule
Optimizing application performance on heterogeneous systems under power and energy constraints poses some challenges. Some are quite sophisticated, like the dynamic phase behaviour of applications. And some are basic hardware issues like the variability of processors: Due to manufacturing limitations, low-power operation of CPUs can cause a wide variety of frequencies across the cores. Adding to these is the ever-growing complexity and heterogeneity on node-level.
A software stack for such heterogeneous exascale systems will have to meet some specific demands. It has to be dynamic, work with highly heterogeneous integrated systems, and adapt to existing hardware. TUM and LRZ are working closely together to build a software stack based on existing and proven solutions. Among others, MPI and its various implementations, SLURM, PMIx or DCDB are well-known parts of this Munich Software Stack.
“The basic stack is already running on the SuperMUC-NG supercomputer at the LRZ”, says Martin Schulz, Chair for Computer Architecture and Parallel Systems at the Technical University of Munich and Director at the Leibniz Supercomputing Centre. “Right now, we are engaged in two European research projects for further development of this stack on more heterogeneous, deeper integrated and dynamic systems, as they will become commonplace in the exascale era: REGALE and DEEP-SEA.” One of the foundations for the next generation of this software stack is the HPC PowerStack[1], an initiative, with TUM as one of the co-founders, for better standardization and homogenization of approaches for power and energy optimized systems.
REGALE aims to define an open architecture, build a prototype system, and incorporate in this system appropriate sophistication in order to equip supercomputing systems with the mechanisms and policies for effective resource utilization and execution of complex applications. DEEP-SEA will deliver the programming environment for future European exascale systems, capable of adapting at all levels of the software stack. While the basic technologies will be implemented and used in DEEP-SEA, the control chain will play a major role in REGALE.
Both projects are focused on making existing codes more dynamic so they can leverage existing accelerators: Many codes today are static and might only be partially ready for more dynamics. This will require some refactoring, and in some cases, complete rewrites of certain parts of the codes. But it will also require novel and elaborate scheduling methods that must be developed by HPC centres themselves. Part of the upcoming research in DEEP-SEA and REGALE will be to find ways to determine where targeted efforts on top of an existing software stack can yield the greatest result. To this end, agile development approaches will play a role: Continuous Integration with elaborate testing and automation are being established on BEAST (Bavarian Energy-, Architecture- and Software-Testbed) at the LRZ, the testbed for the Munich Software Stack.
“Most research in the field of power and energy management today is done site-specific,” Schulz said. “We see little integration of the components; we have a lack of standardized interfaces that work on all layers of the software stack. In the end, this leads to suboptimal performance of the applications and increases the power needed by the system.. With the Munich Software Stack, TUM and LRZ are working on an open, holistic, and scalable approach to an integrated power and energy management in order to get the most out of supercomputers to come.”
The LRZ doesn’t just support researchers with AI technology and applications, we also advise and support you to collect and process data: Get in touch with the LRZ Big Data and AI team if you would like to know how to prepare, harmonise and ultimately implement the largest data sets on LRZ yourself. You can also rest assured that we will provide you with the resources that best suit your AI projects.