Supercomputers are extremely energy-hungry, and high-performance computing (HPC) centers like the RIKEN Center for Computational Science (R-CCS) and the Leibniz Supercomputing Centre (LRZ) rely on direct water cooling to lower the energy costs for their high-performance systems. Compared to air cooling, this method is significantly more economical, and the waste heat from the systems can be used to heat nearby offices and buildings. Now, both computing centres are pooling their expertise to ensure the energy and resource-efficient operation of their next-generation supercomputers. During SC Asia in Osaka, Prof. Satoshi Matsuoka, Director of the RIKEN Center for Computational Science, and Prof. Dieter Kranzlmüller, Chairman of the LRZ's Board of Directors, signed a memorandum of understanding (MoU) to support these goals.
The collaboration aims to facilitate the exchange of knowledge regarding direct, hot-water cooling, thermal optimisation, heat recovery, and energy-aware scheduling of computing tasks. To this end, the two scientific computing centres intend to conduct a comparative analysis of computer technology and jointly develop tools for monitoring and managing tasks.
Both centres are currently planning the successors to their current flagship supercomputers: SuperMUC-NG at LRZ and Fugaku at RIKEN. While the Bavarian Blue Lion and the Japanese FugakuNext concepts differ, both systems will integrate accelerators, including those from NVIDIA. This will enable more advanced artificial intelligence (AI) methods, but will also significantly increase power consumption. RIKEN currently uses cooling water at around 15 degrees Celsius. Meanwhile, LRZ operates at temperatures of up to 40 degrees Celsius and has increased the energy efficiency of its high-performance resources through targeted monitoring. This allows hardware and computing jobs to be controlled in an energy-conscious manner and scientific codes to be optimised for smooth execution. Adjustments to the building technology support water cooling and the economical operation of HPC systems.
Through their MoU, both institutes are increasing cooperation on a variety of topics:
The exchange of data, findings and metrics on the use and operation of hot water-cooled systems, particularly at high utilisation rates.
Cooperation in assessing risks and developing benchmarks for optimal system temperature.
Research into the thermal behaviour of various hardware components, such as temperature-sensitive components for high-bandwidth storage devices.
Investigations into the so-called thermal 'sweet spot', i.e. the optimal temperature for achieving high energy efficiency, computing power and the availability of computing resources.
Evaluation of implementation strategies for the energy-efficient planning of computing and simulation tasks, and the development of the corresponding control tools.
Feasibility studies on heat recovery and the construction of district heating systems in the vicinity of data centres.
Evaluation of use cases and development of scenarios for the integration of energy-conscious usage rules.
The collaboration will initially run until 2030. There are also plans to share technical resources and bring support and HPC operations staff together in workshops or through exchange visits. Discussions will be held to further expand the cooperation and strengthen the partnership between the two centers.