Reliable LRZ network

To guarantee the reliability of its internet and mail services, the LRZ subjected the fail-safety of its routers to a test - and pulled the plug on it.

Redundanz1

Duplicate components: For security reasons, the LRZ's two in-house routers are housed in
different racks and different rooms. Photo: A. Podo/LRZ

Remote access to computers, programmes and data storage run via it, also communication between researchers and institutes or between students and lecturers, as well as research or more than 1500 hosted websites: "The network must always work, the internet must always be available for our users, even if there is a fire in a computer room or the power fails," says Helmut Reiser. The professor for computer science is deputy director of the Leibniz Supercomputing Centre (LRZ) and in this function he is responsible for the reliability of the technology and services. On 19 September 2023, at 8.10 in the morning, Reiser therefore pulled the plug on one of the two in-house routers that the LRZ use to connect to the internet, the Münchner Wissenschaftsnetz (MWN) and the rest of the world.

Redundancy test of the routers

For safety reasons, these two devices are not only installed in different racks, but also placed in different parts of the computer cube: "They are also designed in such a way that in case of failure, one replaces the other within seconds," Reiser explains. "To this end, they are also located in different parts of the building and fire compartments." If cables are smouldering on one floor, computer technology catches fire or the power fails in parts of the data centre, the secure router immediately takes over the work of the other. Whether this redundancy concept works and thus the LRZ online services remain reliably accessible should finally be proven with the pulling of the plugs. "As an IT specialist in a data centre, you usally assemble complex technology for services, but if ever test rarely these concepts and constructs in productive operation," Reiser explains the action. "The principle of hope prevails - at the LRZ as well as in data centres in general. People believe in fail-safety without really having checked it." Such checks are therefore part of the package of measures for the LRZ's service certifications.

Very few users noticed the consequences of this first redundancy test of the routers. There were short delays for about 10 seconds. Online connections were paralysed, mail servers had to re-establish some connections. But most online and web services ran without problems, the shift to only one router was hardly noticeable: "Nowhere did the internet connection or mails drop - the test was successful," says Reiser, and besides satisfaction, relief also resonates in the report. Concerns about the test were high at the LRZ, also because technology tests involve residual risks of malfunctions. For one and a half years, the test has been postponed again and again: "Some colleagues feared that the older routers, which we will soon replace, would no longer start up properly or only partially after the test," adds Reiser. "But of course these should not be reasons not to run any tests at all."

Errors detected in the system

Especially since experience shows that the checks uncover more knowledge than planned: In this case, the LRZ specialists came across two of the four security systems with which the routers work together and which - also redundantly connected - analyse the internet traffic, and they found two computing nodes that were not functioning properly. The errors were corrected and the internet connection was established manually: "We will conduct further reliability tests and simulation games in the future," plans Reiser enterprisingly. Virtual servers, the cloud storage or services like BayernShare or BayernConfluence should be reliable and tested also. (vs/ssch)


Redundanz2

Prof. Dr. Helmut Reiser, Deputy Director of LRZ