Linux Cluster: Status after Emergency Maintenance Oct 28-30

Status Update Nov 8, 15:00

All systems are online again.

Dear users of the Linux Cluster Systems at LRZ,

in order to resolve the stability problems which are currently still affecting a significant fraction of jobs, we have decided to conduct an emergency maintenance phase.

For this reason, all cluster systems will be taken offline on October 28 at 8:00 am.

Overall, the interruption is expected to take three working days, but some parts of the cluster may be returned to operation earlier.

While we have isolated a number of problems, further work is needed to identify precise root causes. Therefore, part of the time above will be needed to perform diagnostics and test runs on the system. We apologize for this further disruption, but the rather diffuse ways the problems show up necessitates this way of proceeding.

Verfasser: R. Bader
veröffentlicht: 2019-11-08