2021-05-04-LRZ to expand its flagship supercomputing system to integrate HPC and AI


E. Mayer for LRZ

LRZ to expand its flagship supercomputing system to integrate HPC and AI

Garching/Munich – May 04, 2021- Together with its partners Intel and Lenovo, the Leibniz Supercomputing Centre (LRZ) will expand its current flagship HPC system SuperMUC-NG, which is part of the Gauss Centre for Supercomputing (GCS). In addition to top performance in simulation and modelling, phase 2 of SuperMUC-NG will integrate and advance artificial intelligence (AI) methods of computation.

For this purpose, the system will be equipped with next-generation Intel Xeon Scalable processors (codenamed Sapphire Rapids) and “Ponte Vecchio”, Intel’s upcoming GPU based on the Xe-HPC micro-architecture for high per-formance computing and AI. The storage system will feature distributed asynchronous object storage (DAOS), and leverage 3rd Gen Intel Xeon Scalable processors and Intel Optane persistent memory to accelerate access to large amounts of data.

As with Phase 1, SuperMUC-NG Phase 2 will be jointly funded by the Free State of Bavaria and the Federal Ministry of Education and Research (BMBF) through GCS. The computing capacities are made available to specially qualified research projects nationwide in a scientific selection process.

New research tasks for supercomputing

“At the core of all LRZ activities is the user. It is our utmost priority to provide researchers with the resources and services they need to excel in their scienfitic domains," says Prof. Dr. Dieter Kranzlmüller, Director of the LRZ. "Over the last years, we’ve observed our users accessing our systems not only for classical modeling and simulation, but increasingly for data analysis with artificial intelligence methods." This requires not only computing power, but different computer architecture and configuration as well as more flexible data storage.

“We’re continuously pushing the boundaries of hardware and software technology to deliver an easy and scalable compute stack in the data center for a wide range of diverse and emerging workloads in HPC and AI,” said Raja Koduri, SVP, chief architect, and general manager of Architecture, Graphics, and Software at Intel. “We are thrilled that LRZ has chosen to partner with Intel in bringing their SuperMUC system to market based on Intel’s XPU product portfolio, advanced packaging and memory technologies, and the unified oneAPI software stack to power the next generation of high performance computing.”

Practical experience with using artificial intelligence methods is increasingly becoming a key capability in science. This is attracting new user groups to LRZ: Until now, it was mostly experts from physics, engineering and the natural sciences who relied on high-performance computing. With AI techniques becoming more widely used, demand in HPC and AI resources is now increasing in the fields of medicine, life and environmental sciences, as well as the humanities. For example, practitioners use automated image, speech or pattern recognition in earth observation or climate data from satellites, anonymized medical imagery and health records, or data demographics. The more complex these neural networks and the desired functions, the higher the demand for computing and fast memory performance.

SuperMUC-NG already offers enormous computing power, but will now be upgraded for more diverse tasks with this expansion: Some of the new technology is currently being tested in the LRZ test environment BEAST (Bavarian, Energy, Architecture, Software Testbed) to better understand its capability in a future large-scale HPC system. To ensure that phase 2 of SuperMUC-NG continues to operate as energy-efficient as possible, the 240 Intel compute nodes are integrated into Lenovo's SD650-I v3 platform, which is directly cooled with warm water, and connected to the DAOS storage system via a high-speed network. Its capacity is 1 petabyte of data storage, but more importantly, this technology enables fast throughput of large data volumes. This system architecture is particularly well-suited then for highly scalable, compute- and data-intensive workloads and artificial intelligence applications.

"The Leibniz Supercomputing Centre has been a thought leader in new technologies for many years, setting standards for research and development and being an important innovation partner for Lenovo. For example, LRZ has already installed warm water cooling and is planning to implement an integrated system for artificial intelligence and deep learning - all from Lenovo", emphasizes Noam Rosen, EMEA Director, HPC & AI, ISG at Lenovo. “Sustainability has also been important for LRZ in its infrastructure projects. That's why we are pleased to play a part in this initiative too, as the Lenovo components for SuperMUC-NG phase 2 will be manufactured in our new production facility in Hungary - rather than in our American or Asian production facilities - further improving the eco-footprint of our supply chain."

Consulting and training

While the DAOS storage system is expected to arrive in Garching in the fall of 2021, the compute system will follow in the spring of 2022. The LRZ is working with its user community in preparation: Researchers already have access to GPU systems specialized on AI applications and LRZ’s HPC and Big Data teams consult and support the users in adapting and optimizing their codes and AI algorithms. The LRZ training program also offers a wide variety of machine and deep learning courses where students and researchers learn how to adapt existing algorithms or develop and train their own.