SuperMUC: SCRATCH problems hopefully resolved
Dear users of SuperMUC,
The root causes for the SCRATCH failures arising sind the last maintenance end of June 2017 have been identified as follows:
- a problem with the cooling of specific hardware components that causes long-term degradation. (hardware fix implemented until end of July)
- the firmware updates during the maintenance contained throughput tunings, leading to higher chip temperatures and triggering failures of hardware that up to now was still operable. Without these tunings, the failures would have been delayed, but not prevented. (firmware fix supplied in July)
- disks going offline under heavy I/O load. (software improvements supplied in late August and early September)
Update (Sep 22, 2017)
After a long period of of complex debugging and intensive testing LENOVO and IBM finally fixed the problem.
We are happy to announce that we resume with normal operation. Please accept our apologies for any inconveniences caused by this long disruption of tis service. Thanks for your patience and understanding.
Users who have used WORK (instead of SCRATCH) should delete unneeded data.