DCDB: Modular, Continuous, and Holistic Monitoring for HPC

Flagship high-performance computing (HPC) systems—such as the SuperMUC at the Leibniz-Supercomputing Centre (LRZ) in Garching near Munich and its successor, SuperMUC-NG, which is almost an order of magnitude faster— consume up to 3MW of power. 

"At the moment, we can do very little to decrease the power consumption of the machines, perhaps only a few percentage points," said Michael Ott. He deals with the supercomputer efficiency, measuring and recording consumption data, and leads the data analysis team in the Energy-Efficient High-Performance Computing Working Group (EE HPC WG)."Working on the computer infrastructure, such as the cooling system, or optimizing programs and applications promise to help increase efficiency. In supercomputing, we need to look at all aspects to increase energy efficiency,” he said.

New measurement tool, better interfaces

LRZ is involved in several projects exploring how to run supercomputers more energy efficiently. For example, LRZ’s flagship supercomputer, SuperMUC-NG, is cooled with an innovative warm-water cooling solution. Now Ott and his team presented their monitoring program Datacenter Data Base (DCDB). In addition to data from the hardware components of the system and sensors in the immediate vicinity of the computer, the program now also records metrics from the operating system and the runtime of the computer itself. Such data, in turn, indicates possibilities for adjustments that can be used to optimize the supercomputers and their energy consumption. 

Optimize an application before using

In addition to data on the building infrastructure—such as water or air temperature and power consumption–the open source software DCDB collects information directly from components of the SuperMUC-NG such as processors, network cards, and storage systems, as well as operating systems, libraries and programs or applications. "If we know which components the applications use and how, we can begin to optimize the execution of these programs and thus increase the efficiency of the computer," Ott explains. "The holy grail in optimizing the operation of a supercomputer would be to know the properties of an application before it is used.”

Better management, lower consumption

The challenge is to better coordinate the performance of supercomputers and reduce the energy consumption of applications without slowing them down. The problem is that they simultaneously access various programs and codes, most of which are not standardized and have been programmed by scientists for their own specific research goals. At LRZ, researchers each have up to 48 hours of computing time to analyse huge amounts of data or create simulations and models from their datasets. "Scientists want to use computing time to solve their research questions,” Ott says. “Optimizing the runtime or reducing energy consumption is of secondary importance to them. But if data centres like LRZ understand which codes and programs work efficiently and which don't, they could help their users improve applications. They could also use this knowledge to develop and build even more efficient components for their super machines.”

Preparation for Exascale

Integration also plays a role in the DCDB monitoring program: the program connects various data silos and harmonizes measured values that were previously collected from various sources. "The modular structure was important to us because it creates flexibility and allows us to connect other databases or software tools without much effort," says Ott. 

The program has been made available to computer centres and researchers as open source software. Ott indicated that there is still a lack of visualization capabilities, but DCDB is already collecting information from the supercomputers like the SuperMUC-NG with the idea to apply this tool in more supercomputers and collect the experiences with it.

These data and the interfaces for optimizing energy management systems also lay the foundations for the next generation of supercomputers, the so-called exascale generation. They help to improve the tendering of technical components, IT systems, and programs through new requirements and criteria. 

Save energy with artificial intelligence

At LRZ the team also experiments with artificial intelligence: DCDB is not only installed on the new SuperMUC-NG, but has also been running for a longer time on the CoolMUC-3, a Linux cluster. The two computer scientists Alessio Netti and Daniele Tafani are experimenting with the first DCDB data sets produced on CooMUC-3. Their teams are investigating whether it is possible to derive forecasts of how much power computers and applications consume in individual work steps. "If we analyse the behaviour of applications on computers, we can use artificial intelligence and machine learning to intervene and optimize energy consumption," Netti said.

Early returns are encouraging: The DCDB data not only shows energy requirements of processors in storage and computing, but also where and when applications consume particularly large amounts of power. With this knowledge, the first software and tools are being developed that analyse and optimize the DCDB data and prepare it for smart decisions. "There is a cycle of smart systems with which we can build machine learning for energy efficiency," noted Tafani. "It is quite possible to intervene in the performance of a supercomputer and coordinate individual work steps in a new or different way.”

Knowledge about the work of supercomputers

It won't be long before artificial intelligence and machine learning actually control a computer's power requirements. Although the data on the machines, applications, and work steps is far from comprehensive, DCDB is busy collecting data. "The data used to estimate the computing time of applications and plan the job mix for the machine are still unreliable," Ott noted. But the good news remains: The next generation of supercomputers can be set up and controlled with the help of data in such a way that they consume significantly less energy. By then, the functionality of applications will also have been researched: Computer science has thus come a step closer to the holy grail of energy efficiency.