”Performing simulations faster saves us a lot of power"

strom

LRZ is using renewable energies. Foto: K. Würth/Unsplash

Supercomputing and artificial intelligence (AI) have come to be viewed critically in terms of the amount of energy that technologies consume: The supercomputer at the Leibniz Supercomputing Centre (LRZ) requires around 3,400 kilowatts of electricity at peak times, enough to power a small town. Systems specialised in AI are not as energy-hungry, but hundreds of training runs are often needed to develop intelligent systems for pattern recognition or natural language processing. Since 2006, the LRZ has been working to reduce the power consumption of its systems. Together with technology companies, it has developed and optimised a hot-water cooling system that has set standards worldwide. Accelerators such as graphics processing units (GPUs) or quantum computers are now expected to bring more efficiency to supercomputing, and software programming is also becoming the focus of necessary changes: "The optimisation of algorithms and applications is currently helping us to reduce the demand for energy ," reports Prof. Dieter Kranzlmüller, head of the LRZ. "When it comes to software and applications, this demand can be further reduced through the use of AI and intelligent control." A conversation about saving energy and developingnew solutions.

AI methods are becoming increasingly popular in science and academic research. Last year, the LRZ put a Cerebras CS-2 system into operation - how much energy does it need at high load? Prof. Dieter Kranzlmüller: With its 850,000 computing nodes, the Cerebras system is a supercomputer, but it specialises in AI methods, machine learning and applications of natural language processing, NLP. To train neural networks, up to 40 gigabytes of data can be stored on a chip and quickly exchanged between processors and nodes. That's a big advantage, but the system still consumes 15 kilowatts in standby mode and around 35 kilowatts at full load - about the same as three families a day. Compared to supercomputers, however, this is really not much. SuperMUC-NG consists of 6480 computing nodes with 311,040 cores, but at full load it consumes around 3,400 kilowatts of electricity. On the other hand, the models and simulations calculated on it are much larger. Nevertheless, the energy requirements of AI are becoming an issue. On the one hand, it is penetrating ever deeper into our daily lives, and on the other, the models and neural networks used in research are growing rapidly, and with them their energy requirements. The University of Copenhagen estimates that the last training round of the ChatGPT text generator alone consumed almost 190.000 kilowatt hours of electricity. Now, we at the LRZ could claim that scientists use our supercomputers to calculate and model the scenarios we need to understand environmental problems and develop strategies against climate change. But since 2006, we have been working to reduce the energy consumption of our computer systems and become a carbon-neutral data centre. The new Cerebras system is an exciting test and research area for us in this respect.

SuperMUC-NG Phase 2 will be equipped with Intel's Ponte Vecchio GPUs. How will they affect the energy consumption of the supercomputer? Kranzlmüller: We don't have any performance data for the new system yet, it's still being built. In addition to the usual central processing units (CPUs), it will also use graphics processing units (GPUs), which are alleged suitable for AI methods and accelerate computing performance, but also require more energy. With its 240 computing nodes, Phase 2 is estimated to consume 500 kilowatts of electricity in normal operation. On the other hand, if we can run simulations faster with the more powerful and versatile GPUs, we can save a lot of energy However, we will only be able to quantify this effect during operation - it will be exciting.

You observe that the AI computing requirements of your users double about every 4 months. The neural networks, or training models, are constantly being expanded with additional parameters and grow enormously. To build a useful model for a research question, hundreds of others are often discarded or adapted. This also increases the energy consumption of AI. Can something be done about this? Kranzlmüller: If you're experimenting with AI methods and processing research data today, you probably don't care so much about energy consumption - that was also the case with modelling and simulation in the past. You try out new methods, discard and optimise them, gain experience - but the methods get better and better and more efficient. That's how progress is made. In the long run, redundancy in AI models could certainly be avoided; instead of training them all, some AI tasks could be programmed to avoid size. But in this first phase of using AI, as a scientific computing centre we can limit the energy demand, especially if we work on cooling the systems and our premises. Cerebras is cooled with water, which is already more efficient than air cooling. With the supercomputer, we save about 20 per cent of energy through hot water cooling and also reduce our carbon footprint by cleverly designing the cooling circuits, using the waste heat and hopefully soon being able to give it to neighbours to air condition rooms. The LRZ also uses 100% renewable energy and will soon be installing solar panels on the roof to generate its own electricity. More generally, the growing energy needs of computer systems are raising questions about the rules of use: Today, we allocate computing time to science and research, so from the user's point of view, it doesn't matter whether computers use a lot or a little electricity. In high performance computing, there is already talk of allocating energy units instead, which could perhaps be an alternative for AI systems in the medium term.

KIstat

The graphic shows the growth of neural networks for training AI applications

How much does the programming of supercomputers or AI models influence energy consumption? Kranzlmüller: This is currently the subject of intensive research in computer science, and two aspects are important here. On the one hand, we are looking at how software can be developed in an energy-efficient way, and on the other hand, how the performance of a programme can be optimised during execution. Indeed, there are programming languages such as Python or Perl that are more energy-intensive than, say, Rust or C. If I can also improve the performance of a programme so that it takes half as long to execute the functions, this will reduce energy consumption, even if the application requires slightly more energy but for a shorter time. In supercomputing, we also work with scientists to tweak algorithms to make them run smoother, faster and more efficiently - a task that can take weeks or even months.

Reducing the complexity of applications is seen as a strategy for energy efficiency in AI. Could this also help to reduce energy requirements in HPC? Kranzlmüller: No, in HPC we are now relying on GPUs and other accelerators that provide higher performance for some functions and run faster, so they use less energy. In the medium term, hopes are also pinned on the integration of quantum computers, which will hopefully also accelerate computing power. At the moment, the optimisation of algorithms and applications is helping. In SuperMUC-NG, several million sensors collect a wide range of operational data that we can use to improve applications and reduce run times. The vision is to use operational data to change the way software is developed in general and to better adapt programmes to the technology.

Are there any other measures to increase the energy efficiency of the supercomputers and AI systems at the LRZ? Kranzlmüller: Our approach is holistic; we work and research in four areas at the LRZ to achieve greater energy efficiency: In the buildings, we ensure energy-efficient cooling of the systems, for example through hot water cooling and the use of waste heat. By dynamically adjusting clock frequencies and using accelerators, we can reduce the energy consumption of the hardware by around 30 percent. Intelligent control and virtualisation of hardware also help save power, as does optimisation of operating systems, libraries and software. As a result of all these measures, SuperMUC-NG requires 2500 kilowatts of energy under full load instead of the 3400 mentioned above. On the hardware side, we have pretty much exhausted the possibilities, but on the software and application side, we may be able to reduce energy consumption even further through the use of AI and intelligent control. For this, we need the operating data, and the LRZ has developed its own tools for monitoring and initial approaches to intelligent evaluation. 

100 percent renewable energy: Does this cover the entire energy requirements of the LRZ, including your supercomputers? Kranzlmüller: For us, the electricity comes out of the socket, but we require the supplier to provide us with electricity exclusively from renewable sources - and to prove this with certificates.

The demand for renewable energy is growing: could supply bottlenecks slow down supercomputing or AI? Kranzlmüller: It is clear that the amount of renewable energy is still limited. If many people use it, there could indeed be bottlenecks. Then the question arises as to whether the applications of all data centres are equally important. How important is a medical simulation and how important is Facebook? But we also need HPC and AI to improve the generation of renewable energy and to optimise its use in HPC systems. (vs)

DK

Prof. Dieter Kranzlmueller, head of LRZ