SuperMUC: phase 2 operation scheduled Nov 15

Dear users of SuperMUC,

due to the bring-up of the next-generation system SuperMUC-NG and restrictions on our electrical infrastructure, we need to limit the power intake of SuperMUC.

The following table describes the current state as well as the near term plans of measures implemented for power saving:

Date System part Description
August 23, 2018 All phase 2 (Haswell) nodes The frequency is capped at 1.8 GHz. This means that even with an energy tag, processors  on the system will clock at most this capping value.  For the duration of this measure, maximum job execution time on the micro and general queues has been adjusted to 60 hours to compensate for slower execution speed.
August 24, 2018 All phase 1 (Sandy Bridge) nodes The frequency is capped at 2.0 GHz. This means that even with an energy tag, processors  on the system will clock at most this capping value.  For the duration of this measure, maximum job execution time on the general and large queues has been adjusted to 70 hours to compensate for slower execution speed.
October 4, 2018 All phase 1 (Sandy Bridge) nodes Phase 1 thin nodes will become unavailable for user operation on the late afternoon of October 4. The scheduler will be switched to draining mode in advance. The outage is expected to last for some weeks. We will update this document once we know the date for returning phase 1 to user operation. The fat island remains available for job processing.
October 19, 2018 File systems The WORK and SCRATCH file systems are accessible again via the phase 2 login nodes.
October 22, 2018 Project status Extension of all SuperMUC projects by two months without additional CPU hours which would normally have expired between 2018-10-17 and 2019-02-10
November 15, 2018 All phase 2 (Haswell nodes) System operation is expected to restart again. Frequency is capped at 1.8 GHz as described above.
November 21, 2018 Phase 1 (Sandy Bridge) nodes 10 islands of phase 1 have been returned to user operation. Frequency is capped at 2.0 GHz as described above.

Please note that these measures have consequences for job processing in that

  • individual jobs may show reduced performance and therefore require more wall time to complete. You may need to adjust the frequency of checkpointing, or the job run time.
  • overall job throughput will be reduced

Verfasser: R. Bader, H. Huber
veröffentlicht: 2018-11-21