Many-Core Cluster CooLMUC-3 at LRZ

 

In September of 2017, the Leibniz Rechenzentrum in Garching will put a new cluster into service, based on Intel Many-Core processors. Innovations are not limited to the node architecture, but also extend to the interconnect and cooling infrastructure.

LRZ’s procurement targets for the new cluster system were:

  • To supply a system to its users that is suited for processing highly vectorizable and thread-parallel applications,
  • provides good scalability across node boundaries for strong scaling, and
  • deploys state-of-the art high-temperature water-cooling technologies for a system operation that avoids heat transfer to the ambient air of the computer room.

Furthermore, the system features an extensible design, enabling seamless addition of further compute nodes of various architectures.

Upon conclusion of a European procurement process, Leibniz Supercomputing Centre (LRZ) has signed a contract with MEGWare (https://www.megware.com) for delivery and installation of an HPC cluster based on Intel’s many-core architecture. Its baseline installation will consist of 148 computational many-core Intel “Knight’s Landing” nodes (Xeon Phi 7210-F hosts), connected to each other via an Intel Omnipath high performance network in fat tree topology. A standard Intel Xeon login node will be available for development work and job submission. CooLMUC3 will be comprised of three water-cooled racks, using an inlet temperature of at least 40 °C, and one rack for the few remaining air cooled components (e.g. management servers) that use less than 3% of the systems total power budget. A very high fraction of waste heat deposition into water is achieved by deployment of liquid-cooled power supplies and thermal isolation of the racks that suppresses radiative losses. Also, the Omnipath switches will be delivered as water-cooled implementations and therefore do not require any fans.

Many-core Architecture, the “Knights Landing” Processor

The processor generation installed in this system is the first from Intel to be stand-alone; previous generations were only available as accelerators, which greatly adds to the complexity and effort required to provision, program, and use such systems. The LRZ Cluster, in contrast, can be installed with a standard Operating System, and can be used with the same programming models familiar to users of  Xeon-based clusters. The high speed interconnect between the compute nodes is realized by an on-chip interface and a mainboard-integrated Dual-port adapter, which is of advantage to latency-bound parallel applications, in comparison to PCI-card interconnects.

A further feature of the architecture is the closely-integrated MCDRAM, also known as “High-Bandwidth Memory” (HBM). The bandwidth of this memory area is an order of magnitude higher than that of conventional memory technology. It can either be configured as cache memory, directly-addressable memory, or a 50/50 hybrid of the two types.  The vector units have also been expanded – each core contains two AVX-512 VPUs and can therefore, when multiplication and addition operations are combined, perform 32 double-precision floating-point operations per cycle. Each pair of cores are tightly coupled and share a 1MB L2 cache to form a “tile”, and 32 tiles share a 2-dimensional interconnect with a bisection bandwidth of over 700GB/s over which cache coherence and traffic flow in various user-configurable modes. To what extent such configuration options are capable of optimizing specific user applications is currently under evaluation, as in general a reboot of the affect nodes is required.

Because of the low core frequency as well as the small per-core memory of its nodes, the system is not suited for serial throughput load, even though the instruction set permits execution of legacy binaries. For best performance, it is likely that a significant optimization effort for existing parallel applications must be undertaken. To make efficient use of the memory and exploit all levels of parallelism in the architecture, typically a hybrid approach (e.g. using both MPI and OpenMP) is considered a best practice. Restructuring of data layouts will often be required in order to achieve cache locality, a prerequisite for effectively using the broader vector units. For codes that require use of the distributed memory paradigm with small message sizes, the integration of the Omnipath network interface on the chip set of the computational node can bring a significant performance advantage over a PCI-attached network card.

LRZ has acquired know-how throughout the past three years in optimizing for many-core systems by collaborating with Intel. This collaboration included tuning codes for optimal execution on the previous-generation “Knight’s Corner” accelerator cards used in the SuperMIC prototype system; guidance on how to do such optimization will be documented on the LRZ web server, and can be supplied on a case-by-case basis by the LRZ application support staff members. The Intel development environment (“Intel Parallel Studio XE”) that includes compilers, performance libraries, an MPI implementation and additional tuning, tracing and diagnostic tools, assists programmers in establishing good performance for applications. Courses on programming many-core systems as well as using the Intel toolset are regularly scheduled within the LRZ course program[1].

Omnipath Interconnect

With their acquisition of network technologies in the last decade, Intel has chosen a new strategy for networks, namely the integration of the network into the processor architecture. This cluster will be the first time the Omnipath interconnect, in its already mature first generation, is put into service at the LRZ. It is identified by its markedly lower application latencies, higher achievable message rates, and high aggregate bandwidths at a better price than competing hardware technologies. The LRZ will use this system to gather experience with the management, stability, and performance of the new technology.

Cooling Infrastructure

LRZ was already a pioneer in the introduction of warm-water cooling in Europe. CooLMUC-1, installed in mid-2011 by MEGWare, was the first such system at the LRZ. A little more than year later, the IBM/Lenovo 3-PFlop “SuperMUC” went into service, the first system that allowed an inlet temperature of around 40°C, in turn allowing year-round chillerless cooling. Furthermore, the high water temperatures allowed this waste energy to be captured in the form of additional cooling power for the remaining air-cooled and cool-water-cooled components (e.g. storage) – in 2016, after a detailed pilot project in collaboration with the company Sortech, an Adsorption-cooling system was put into service for the first time, converting roughly half of the Lenovo Cluster’s waste heat into cooling capacity.

CooLMUC-3 will now take the next step in improving energy efficiency. Via the introduction of water cooling for additional components like power supplies and network components, it will be possible to thermally insulate the racks and virtually eliminate the emission of heat (only 3% of the electrical energy) to the server room.

 

 

Overview of CoolMUC3 characteristics

Hardware

Number of nodes

148

Cores per node

64

Hyperthreads per core

4

Core nominal frequency

1.3 GHz

Memory (DDR4) per node

96 GB (Bandbreite 80.8 GB/s)

High Bandwidth Memory per node

16 GB (Bandbreite 460 GB/s)

Bandwidth to interconnect per node

25 GB/s (2 Links)

Number of Omnipath switches (100SWE48)

10 + 4 (je 48 Ports)

Bisection bandwidth of interconnect

1.6 TB/s

Latency of interconnect

2.3 µs

Peak performance of system

394 TFlop/s

Infrastructure

Electric power of fully loaded system

62 kVA

Percentage of waste heat to warm water

97%

Inlet temperature range for water cooling

30 … 50 °C

Temperature difference between outlet and inlet

4 … 6 °C

Software (OS and development environment)

Operating system

SLES12 SP2 Linux

MPI

Intel MPI 2017, alternatively OpenMPI

Compilers

Intel icc, icpc, ifort 2017

Performance libraries

MKL, TBB, IPP

Tools for performance and correctness analysis

Intel Cluster Tools

 

The performance numbers in the above table are theoretical and cannot be reached by any real-world application. For the actually observable memory bandwidth of the high bandwidth memory, the STREAM benchmark will yield approximately 450 GB/s per node, and the commitment for the LINPACK performance of the complete system is 255 TFlop/s.