Award-Winning Supercomputing Research

The more than 6480 compute nodes of SuperMUC-NG contain around 15 million sensors that collect a wide range of data from the system. "In preparation for Exascale times, high-performance computing systems are becoming increasingly complex," explains Alessio Netti, computer scientist at the Leibniz Supercomputing Centre (LRZ) in Garching. "For these systems to run stable, become more controllable and, above all, consume significantly less energy, we need more knowledge and thus more data". At the end of June, two projects which deal with operating data of high-performance computers, received awards: The jury of the ACM HDPC 2020 conference honoured the LRZ tool Wintermute in Stockholm as one of the most innovative analysis methods for High Performance Computing (HPC). At ISC 2020 in Frankfurt, a research team led around Amir Raoofy from Technical University of Munich (TUM) won the Hans Meuer Award for their work on 'Time Series Mining at Petascale Performance’.

Collecting and Evaluating the Right Data

Sensors already provide all kinds of information from supercomputers, for example on temperature, power, load and stress on components. The open source software Data Center Data Base (DCDB), which collects data from millions of sensors and thus enables the control of SuperMUC-NG and CoolMUC-3, has already been developed at

the LRZ. In order to be able to monitor and operate these systems efficiently, an analysis tool is needed, but above all a systematic approach for evaluating these data. With Wintermute, Netti presented a generic model at the digital edition of the HDPC conference and thus a basis for Operational Data Analytics (ODA). It is intended to provide as comprehensive a picture of supercomputers as possible and to enable forecasts and adjustments to be made to the technology. To this end, Wintermute processes information generated in components (in-band data) or sent by them (out-of-band data), either in a streaming process, continuously (online processing) or only when explicitly required (on-demand processing).

Using three case studies done on CoolMUC-3, the LRZ computer scientist shows which monitoring data can be used to detect anomalies in individual computer nodes, for example, in order to exchange or optimise them. Energy consumption can also be tracked and adjusted using Wintermute and selected data. The open source tool also shows where technology is causing bottlenecks in simulation and modelling. "Wintermute uses machine learning methods to make Operational Data Analytics more meaningful and thus more powerful," says Netti. "The tool was designed to be integrated into any existing monitoring system." The name actually refers to this: Wintermute is the name of an artificial intelligence that combines with another in a science fiction trilogy by William Gibson and becomes a - better - digital life form. The findings from Wintermute can help to improve computer systems of the future.

A Scalable approach for analysis of huge time series

Amir Raoofy, PhD candidate at the Chair for Computer Architecture and Parallel Systems at the Technical University of Munich (TUM) headed by Professor Martin Schulz, also works with data supplied by thousands of sensors from supercomputers or from monitoring systems installed at power plants. The data has been generated over weeks or even years. However, Raoofy is interested in how SuperMUC-NG and CoolMUC-3 handle these huge amounts of data. "Using matrix profile algorithms, time series can be searched for patterns and similarities," says Raoofy, outlining the problem. "But they are difficult to scale and are not suitable for HPC systems". However, the evaluation of large time series requires supercomputing power: Anyone who wants to know under which conditions a gas turbine will run reliably and when the first components will be susceptible to repair should be able to check vast amounts of data. However. the compute power and capabilities provided by supercomputers make such analyses possible only in combination with appropriate and scalable algorithms.

Raoofy and his colleagues have developed the now award-winning scalable approach (MP)^N. This can currently be run efficiently on up to 256,000 computer cores - that is around 86 percent of the computing resources of LRZ’s SuperMUC-NG system. The fact that it delivers exact calculations was tested with performance data from SuperMUC-NG. The algorithm is currently being used to analyze data supplied by two gas turbines belonging Stadtwerke München within the context of the TurbO research project. TurbO is a research project funded by. "In our experiments, we performed the fastest and largest multidimensional matrix profile ever calculated," reports Raoofy. "We achieved 1.3 Petaflop" This means that supercomputers like SuperMUC-NG can quickly and efficiently evaluate data from long time measurements - science and technology will know how to use this. (vs)

The winning paper of the Hans Meuer Award "Time Series Mining at Petascale Performance" is available for free download.