SuperMUC Phase 2 (Haswell nodes) best practice guide
This guide provides information about SuperMUC Phase2 in order to enable users to use the system and to achieve a good performance of their applications.
Table of contents
Access to SuperMUC Phase 2 Haswell Nodes
The login nodes on phase 2 are based on the operating system SUSE Linux Enterprise Server 11.3 (SLES11.3) distribution. The system set software installed includes a wide selection of packages and libraries that enables to compile your programms.
SuperMUC phase 2 has a separate login node and can only be accessed with the secure shell (SSH). In order to access please perform the following UNIX command line:
ssh -Y email@example.com
Before compiling your code, or running your executable, first display and check the currently loaded module files, using the command:
It is sometimes useful to be able to verify the MPI libraries, compiler versions and the available environment variables where your code works well. It's good idea to save the output of module list into a dedicated text file. This way you can check if the loaded modules have been changed, which may be the cause of a problem, e.g: if your program doesn't run or compile any more after an update of the module system, system maintenance or upgrades of our operating systems. The most of the time the output of module list has been changed, and if this happens you might need to remove the default loaded modules and load the exactly he same modules that you have saved before.
There are many other useful module commands, please consult the environment module system at LRZ.
Using MPI: All parallel compilers are accessed through the LRZ wrappers: mpicc, mpiCC or mpif90, where mpicc will launch a C compiler and link the available loaded MPI libraries, and the same for mpiCC for C++ compiler and mpif90 for Fortran compiler.
For example, to achieve the maximum performance from the Haswell processors for programs written in C, C++ or Fortran, use the following Haswell-specific optimization flags:
icc -O3 -xCORE-AVX2 program.c
icpc -c program.cpp -xHost -O3 -xCORE-AVX2
ifort program.f90 -O3 -xCORE-AVX2
More compiler options and information can be found under this link Intel_compiler_options.
avx2 flag is added officially since GCC4.8, and on the experimental 4.7: -march=core-avx2 ; it requires binutil 2.22 for assembler and debugger support since GDB 7.4. Example:
module unload fortran/intel ccomp/intel
module load gcc/4.8
gcc -c program.c -march=core-avx2
Runing an executable with the Haswell flags above on SuperMUC Phase 1 or SuperMIC will result in the following error:
"Fatal Error: This program was not built to run in your system. Please verify that both the operating system and the processor support Intel(R) AVX2 "
Shared memory (OpenMP) and hybrid(MPI + OpenMP)
Shared memory (OpenMP) allows for implicit intra-node communication and provides an efficient utilization of shared memory SMP systems. Each compute node on Phase 2 contains 2 Haswell processors each 14 cores per socket ( 28 cores per node) with Hyper-Threading up to 56 cores per node.
To compile an OpenMP code use the following flags:
-openmp with Intel compiler
-fopenmp using gnu compiler
Hybrid (MPI +OpenMP) facilitates shared memory programing(OpenMP) within the SMP nodes and MPI communication between the nodes.
Redesigning a code to hybrid, whether originally using OpenMP or MPI communication, is the ideal situation for SuperMUC phase 2; although, this is not always possible or desired.
- Intro-to OpenMP
- AVX2 Support in the Intel® C++ Compiler
- A Guide to Vectorization with Intel® C++ Compilers
- It's not possible to submit jobs to SuperMUC phase 2 from login nodes of phase 1.
- Only complete nodes are provided to a given job for dedicated use.
- Accounting is performed by using: AllocatedNodes*Walltime*(number_of_core_in_node).
- Core hours of phase 1 and phase 2 are accounted equally (1 core-hour of phase1 = 1 core-hour of phase 2)
- Running large jobs (>512 nodes) requires permission for the job class "special". User must pass their requests for "special" through the LRZ service desk.
Running batch jobs in phase 1 or phase 2 is very similar. However, please note the different size of the number of cores per node.