Award-Winning Supercomputing Research

The more than 6480 compute nodes of the SuperMUC-NG contain around 15 million sensors that collect a wide range of data from the system. "In preparation for exascale times, high-performance computing systems are becoming increasingly complex," explains Alessio Netti, computer scientist at the Leibniz Supercomputing Centre (LRZ) in Garching. "For these systems to run stable, become more controllable and, above all, consume significantly less energy, we need more knowledge and thus more data". At the end of June, two projects which deal with operating data of high-performance computers, received awards: The jury of the ACM HDPC 2020 conference honoured the LRZ tool Wintermute in Stockholm as one of the most innovative analysis methods for High Performance Computing (HPC). At the ISC 2020 in Frankfurt, a research team led by Amir Raoofy from the Technical University of Munich (TUM) won the Hans Meuer Award for theirwork on 'Time Series Mining at Petascale Performance'.

Collecting and Evaluating the Right Data

Sensors already provide all kinds of information from supercomputers, for example on temperature, power, load and stress on components. The open source software Data Center Data Base (DCDB), which collects data from millions of sensors and thus enables the control of SuperMUC-NG and CoolMUC-3, has already been developed at the LRZ. In order to be able to monitor and operate hese systems efficiently, an analysis tool is needed, but above all a systematic approach for evaluating these data. With Wintermute, Netti presented a generic model at the digital edition of the HDPC conference and thus a basis for Operational Data Analytics (ODA). It is intended to provide as comprehensive a picture of supercomputers as possible and to enable forecasts and adjustments to be made to the technology. To this end, Wintermute processes information generated in components (in-band data) or sent by them (out-of-band data), either in a streaming process, continuously (online processing) or only when explicitly required (on-demand processing).

Using three case studies done on CoolMUC-3, the LRZ computer scientist shows which monitoring data can be used to detect anomalies in individual computer nodes, for example, in order to exchange or optimise them. Energy consumption can also be tracked and adjusted using Wintermute and selected data. The open source tool also shows where technology is causing bottlenecks in simulation and modelling. "Wintermute uses machine learning methods to make Operational Data Analytics more meaningful and thus more powerful," says Netti. "The tool was designed to be integrated into any existing monitoring system." The name actually refers to this: Wintermute is the name of an artificial intelligence that combines with another in a science fiction trilogy by William Gibson and becomes a - better - digital life form. The findings from Wintermute can help to improve computer systems of the future.

A Matrix for Sorting Data

Amir Raoofy, PhD candidate at the Chair for Computer Architecture and Parallel Systems at the Technical University of Munich (TUM) of Professor Martin Schulz, also works with data supplied by thousands of sensors from supercomputers or from the monitoring systems of power plants over weeks or even years. However, he is interested in how SuperMUC-NG and CoolMUC-3 handle the huge amounts of data. "Using matrix profile algorithms, time series can be searched for patterns and similarities," says Raoofy, outlining the problem. "But they are difficult to scale and are not suitable for HPC systems". However, the evaluation of large time series requires supercomputing: Anyone who wants to know under which conditions a gas turbine will run reliably and when the first components will be susceptible to repair should be able to check a lot of data. The computing power and capabilities of supercomputers make such analyses possible only in combination with scalable algorithms.

Raoofy and his collegues developed the now award-winning scalable approach (MP)N. This can be run efficiently on up to 256,000 computer cores - that is around 86 percent of the computing resources of the SuperMUC-NG. The fact that it delivers exact calculations was tested with performance data from the SuperMUC-NG. The algorithm is currently being used to analyze data supplied by two gas turbines belonging to Stadtwerke of Munich. TurbO is the name of the project funded by the Bavarian Research Foundation. "In our experiments, we performed the fastest and largest multidimensional matrix profile ever calculated," reports Raoofy. "We achieved 1.3 petaflops per second." This means that supercomputers like the SuperMUC-NG can quickly and efficiently evaluate data from long time series - science and technology will know how to use this. (vs)