ALIs

kommt noch

Development Environment for the SuperMUC HPC system

This document gives an overview of the most important program development tools;


Table of contents


Programming Environment

Using Modules to handle environment settings

LRZ uses the modules approach to manage the user environment for different software, library or compiler versions. The distinct advantage of the modules approach is that the user is no longer required to explicitly specify paths for different executable versions, library versions and locations of other entities needed for the execution environment.


Operating environment

All nodes of the SuperMUC run a GNU/Linux based operating environment.

Software  
Operating System: SUSE Linux Enterprise Server SLES 11.1 SP1
Kernel 2.6.32
libc glibc 2.11.1
gcc release See LRZ gcc document for details
X-Window-System (Xorg-X11) 7.4
GNU all available packages
TSM Backup Client 6.2
Batch Queuing IBM Tivoli Workload Scheduler LoadLeveler

Development Tools and Libraries

Overview

Activity

Tools

Linux versions

Source code development

Editors

vi, vim, emacs, etc.

Executable creation Compilers icc, icpc, ifort, gcc, gfortran, g95, pgif90, pgicc
Parallel executable creation Compilers mpif90, mpicc, mpiCC: provided my module for the IBM-MPI and Intel-MPI environment

Archiving

Library Archiver

ar

Object and executable file inspection

Object tools

objdump, ldd

Debugging

Debuggers

gdb, idb, totalview, ddd, DDT, Threading Tools

Performance analysis

Profilers

VTune Amplifier XE(aka VTune), Intel Tracing Tools,

Automation

Make

make, gmake

Environment configuration

modules, embedded Tcl

module

For details see the SuperMUC Software page for a listing of available software packages.

Eclipse (integrated development environment)

Eclipse was designed as an integrated development environments (IDE). The Eclipse CDT installed at LRZ provides an IDE for C/C++ development, as well as keeping the (generic) support for developing Java applications and Eclipse plug ins, since Eclipse itself is programmed in Java.

Main features of the Eclipse CDT include:

  • C/C++ Editor (basic functionality, syntax highlighting, code completion etc.)
  • C/C++ Debugger (using GDB)
  • C/C++ Launcher (APIs & Default implementation, launches and external application)
  • Search Engine
  • Content Assist Provider
  • Makefile generator
  • Graphical CVS management

Photran, which provides an IDE for Fortran90, 95, possibly 95+, has been integrated with CDT, and VTune Amplifier XE (formerly known as VTune) with full Eclipse support. Hopefully, Threading Tools will be included in future, as well. Furthermore, there are plans at LRZ to install and support additional Eclipse toolkits as these become available and/or stable enough to be recommended for general use.


Compilers and Parallel Programming

A complete list of all compilers and parallel programming libraries available on SuperMUC can be obtained using the following module command

module avail -c compilers
module avail -c parallel
 

which list all packages installed in LRZ's module classes compilers and parallel.

Important: Using compilers within batch jobs is explicitly disfavored and no compiler licences are provided on compute nodes. Moreover, massive parallel batch jobs are able to execute numerously compiler licences requests which will eventually bring down the licence server. Therfore, such jobs must be avoided.

Intel Compilers and Performance Libraries

Since SuperMUC is based on Intel's SandyBridge/Westemere technology, LRZ recommends the usage of the Intel Compilers and Performance Libraries as a first choice. Licensing and support agreements with Intel ensure that bug-fixes should be available within reasonably short order.

GNU Compilers

We recommend to use the Intel Compilers on SuperMUC. Only if strict compatibility to gcc/gfortran use these compilers.

PGI Compilers

Some commercial packages still require availability of the Portland Group compiler suite. High Performance Fortran is also supported by the PGI Fortran compiler, as well. Hence, this package is available on SuperMUC. In order to use the PGI compiler on the login-nodes of SuperMUC, please load the compiler modules ccomp/pgi and fortran/pgi. Anyway, support for the PGI compilers is limited: the LRZ HPC team will report bugs to PGI, but this is kept at low priority.

Documentation for the PGI compilers is available from the PGI web site.

MPI

Programs parallelized with MPI can of course run on SuperMUC. Please refer to the Section Parallelization for general information about MPI on SuperMUC, and to the Sections IBM MPI and POE, and Intel MPI, for specific informations about the MPI standard flavors on SuperMUC. In the MPI introductory document additional details on MPI and further available MPI flavours and their handling can be found.

OpenMP

The Multi-Threading capabilities of Intel SMPs can be used via the OpenMP implementations of the Intel and PGI Compilers. Examples for OpenMP programming in Fortran as well as general information about OpenMP can be found in the Sections Parallelization: MPI, OpenMP and POE and the LRZ OpenMP introduction.


An Overview of Compiler Functionality

The following sections discuss the most commonly used compiler switches and extensions implemented by the Intel, PGI, and g95 Compilers. We give an overview of the available optimization switches. If you experience any difficulties, you might have to progressively switch off some of them again.

Please also consult the tuning document for further details and tools for optimization.

Optimization Options for x86_64 processors

Option 
Intel
Option 
PGI
Option 
gcc, gfortran
Meaning Comments
-O[0-3] t.b.d. t.b.d. Specifies the code optimization level for applications. Here O0 specifies no optimization, whereas O3 specifies the highest optimization level (see Compiler Documentations for further details).
-fast t.b.d. t.b.d. Maximizes speed across the entire program. Sets the following options -ipo, -O3, -no-prec-div, -static, and -xHost .
-xHost t.b.d. t.b.d. Tells the compiler to generate instructions for the highest instruction set available on the compilation host processor. Host maybe replaced by AVX1/2, SSE4.1/2, SSE2/3, or SSSE3 (see Compiler Documentations for further details).
Intel compiler version 10 and higher have

-opt-streaming-stores [always|never|auto]

,for other versions please use the source code directive
!DEC$ VECTOR NONTEMPORAL


instead.
-Mnontemporal   Some programs may slow down with -fastsse due to prefetches used. Adding -Mnontemporal offers a different data movement scheme which may improve performance. Worth a try during code tuning. May especially be useful for memory-bound code, since this supports cache bypass for streaming writes.

Options for Code Transformations, Aliasing and Interprocedural Optimization

Option 
Intel
Option 
PGI
Option 
gcc, gfortran
Meaning Comments
-fno-alias (-fno-fnalias) n/a n/a Assume no aliasing (within functions) This may give a considerable performance increase. Beware: Check your code yourself for pointer aliasing!
-unroll[<number>] -Munroll[=n:<number>] -funroll-loops, -funroll-all-loops Unroll loops <number> (optional) gives the maximum number of times for unrolling. 0 disables unrolling, omitting it enables compiler heuristics for unrolling. Note that for the Intel compiler you can instead use a source code directive
!DEC$ UNROLL(<number>)
       do i=1,imax
         ... 
in your code, which might be more useful.
-ip -Minline[=option[,option,...]] -finline-functions Enables interprocedural optimizations for single file compilation performs inline function expansion for calls to functions defined within the current source file. For Intel compilers, you can disable full/partial inlining enabled by this option by also specifying -ip_no_inlining/-ip_no_pinlining. For the PGI compiler, please check out man page and user's guide for more information on inlining.
-ipo -Minline and -Mextract with suboptions n/a Enables multifile interprocedural (IP) optimizations (between files). Performs inline function expansion for calls to functions defined in separate files. For the Intel compiler, a set of source files must be specified as an argument. For the PGI compiler, an inline library must be explicitly created.

Linkage Options

Option 
Intel
Option 
PGI

gcc, gfortran
Meaning Comments
-static -Bstatic -Wl,-Bstatic
-nonshared
force static linkage Recommended if binary is to be run on a machine where the compiler is not installed. Considerably increases executable size!
-[no-]heap-arrays       Allocate automatic arrays on heap (Fortran; default is to allocate on stack, which may lead to trouble for low stack limits)
-auto       Direct all local variables to be automatic (Fortran)

  -c

compile only, do not link  This follows conventional usage.
n/a -g77libs n/a add GNU Fortran libraries Needed if g77-built objects are to be linked correctly. The Intel Compiler does not support this.

-Ldir

look for libraries in dir as well This follows conventional usage.

-lmylib

link with library libmylib.{a|so} This follows conventional usage.

Source format and Preprocessing

Option 
Intel
Option 
PGI
Option 
gcc, gfortran
Meaning Comments
-FI or -fixed [-72|-80|-132] -Mfixed   fixed format source code [with possibly extended width] source file extension .f (Intel: also .ftn .for) automatically assumes fixed form
-FR or -free -Mfree   free format source code source file extension .f90 automatically assumes free form
-fpp [-P] -F   Invoke preprocessor (C-style includes) Intel Compiler: optional -P switch puts preprocessing results in output_file instead of compiling it.
Open64 Compiler: -o switch required for preprocessing to output_file.
PGI Compiler: source file must have extension .F, output is put into matching file with extension .f.
-Dname[=value] define preprocessor macro this follows conventional usage.
-Idir look for include files in dir as well. This follows conventional usage.

Options for Data and I/O

Option 
Intel
Option 
PGI
Option 
gcc, gfortran
Meaning Comments
-i{2|4|8} INTEGER and LOGICAL types of unspecified KIND use the indicated amount of bytes Default value is 4; -i2 not available for Open64
-r{4|8|16} -Mr8 -r{4|8} REAL types of unspecified KIND use the indicated amount of bytes Default value is 4. A value of 8 would change all REAL variables to DOUBLE PRECISION. For the PGI Compilers only promotion from 4 to 8 byte REAL is available.
Controlled via environment run time option. See Section on Big Endian I/O in the Troubleshooting document -Mbyteswapio
-byteswapio
(probably not available) Do unformatted I/O in big endian instead of little endian PGI Compiler: should enable you to read and write data compatible to Sun and SGI platforms.

Diagnostics, Runtime Checking and Debugging

Option 
Intel
Option 
PGI
Option 
gcc, gfortran
Meaning Comments
-g Include symbols for debugging Use idb or Totalview to debug, or pgdbg for PGI-compiled binaries
-check all

This option applies to Fortran Compilers only. The argument "all" switches on all available checks. It can be replaced by:

  • arg_temp_created: check for copy-in/copy-out for procedure arguments.
  • bounds: performs run-time checks on array subscripts and substring references
  • format, output_conversion: performs run-time checks on formatted I/O
  • pointers: performs run-time checks on pointers and allocatables
  • uninit: run-time checks on uninitialized variables (except module globals)
-C (g77 had -ffortran-bounds-check) run time checking Full checking may incur a large performance penalty.
-opt-report -opt-report-level[min|max]   n/a generate optimization report The Intel compiler writes the report to stderr
-list -Mlist n/a provide source listing The Intel compiler writes the source listing to STDOUT, while the PGI compiler produces a file myprog.lst from myprog.f

Parallelization and Vectorization Option

Option 
Intel
Option 
PGI
Option 
gcc, gfortran
Meaning Comments
-openmp -mp   generate multithreaded code from OpenMP directives in the source code If used, this option must also be specified for linkage.
-openmp-stubs n/a   Compile OpenMP programs for serial mode; directives are ignored and a stub library for the function calls is linked. If used, this option must also be specified for linkage.
-openmp-report[0|1|2] n/a   Diagnostic level for OpenMP parallelization  
-parallel -Mconcur
[=option[,option]]
  perform (shared-memory) auto-parallelization If used, this option must also be specified for linkage. Please refer to the PGI User's Guide, Section 3.1.2 for information on the -Mconcur suboptions.
-par-report[0|1|2] n/a   Diagnostic level for automatic parallelization  
-par-threshold{n} n/a   set threshold for autoparallelization of loops -par_threshold0 : always parallelize
-par_threshold25 : parallelize if chance of perf. increase is 25%
-par_threshold75 : parallelize if chance of perf. increase is 75% (default)
-par_threshold100 : only parallelize if absolutely sure.

For the PGI compiler, the -Mconcur suboptions (q. v.) allow for a finer control of autoparallelization

-vec t.b.d.   Enables or disables vectorization.  
-simd t.b.d.   Enables or disables the SIMD vectorization feature of the compiler.  
-vec-report[0-5] t.b.d.   Controls the diagnostic information reported by the vectorizer. Here 0 specifies to report no diagnostic information, for the other levels please consult the Compiler Documentations.
-vec-threshold[n] t.b.d.   Sets a threshold for the vectorization of loops. -par_threshold0 : always vectorize
-par_threshold75 : vectorize if chance of perf. increase is 50%
-par_threshold100 : only vectorize if absolutely sure (default).

Compiler Directives for the Intel compiler

The following table shows the source code directives as supported by the Intel Fortran compiler to help with tuning or debugging applications. Note that for fixed source format the "!" comment symbol in the first column needs to be replaced with a "c" comment symbol.

Directive Meaning

!DEC$ ivdep

Ignore vector dependencies

!DEC$ loop count N

Software pipelining hint

!DEC$ distribute point

Split large loop

!DEC$ unroll

Unroll inner loop N times. Compiler heuristics used if N omitted.

!DEC$ nounroll

Do not unroll loop

!DEC$ prefetch A

Prefetch Array A

!DEC$ noprefetch A

 Do not prefetch array A

!DEC$ vector [CLAUSE]

Vectorize loop,

CLAUSE = { ALWAYS [ASSERT]|ALIGNED|UNALIGNED|TEMPORAL|NONTEMPORAL [(var1 [, var2]...)] }

For further details please see Compiler Documentations.

!DEC$ novector

Do not vectorize loop.


Debuggers

Debuggers with graphical Interface (GUI)

  • DDT:             Distributed Debugging Tool: a commercial product by  Allinea Software.
  • Totalview:      A commercial product by Etnus.

The GUI driven debuggers offer a graphical user interface; simple debugging sessions can therefore be  handled without  intensive, prior study of man-pages and manual.

  • DDT and Totalview are  advanced tools for more complex  debugging, especially when it comes to debugging parallel codes (MPI, OpenMP). They allow to inspect  data structures in the different threads of a parallel program, set global breakpoints, set  breakpoints in individual threads, etc.
  • DDT is the preferred debugger at SuperMUC, and the largest number of licences is available.
  • Totalview can also be used in CLI  mode, whereas DDT is a pure GUI tool.

Table of available Debuggers and Info

Please note, that the environment variables given in the column Documentation (e.q. $TOTALVIEW_DOC) refer to environment variables set by the module command on the LRZ HPC systems.

Name Interface
Supported
Compilers
Programming
Model
LRZ module
Documentation
Recommended Debuggers
 DDT GUI g77, g95, icc, ifort serial, parallel (MPI, OpenMP) module load ddt PDF ($DDT_DOC)
Totalview
GUI, CLI g77, icc, ifort serial, parallel (MPI, OpenMP) module load totalview PDF ($TOTALVIEW_DOC)
HTML

Other debuggers like gdb, idb or DDD are available, but they can hardly be used on the compute nodes.


Threading Tools

The Threading Tools allow you to perform correctness and performance checking on multi-threaded applications (running in shared memory). The parallelization method may be based on POSIX or Linux Threads, or on OpenMP. For OpenMP applications it is necessary to use the Intel compilers in combination with suitably chosen compiler switches to perform the analysis of applications.

  • Thread Checker is the tool which identifies and locates threading issues. Very often, concurrency problems (race conditions) are overlooked by the user during the parallelization process. This tool reliably identifies all problems of this kind; under the right conditions it is also possible to specify the exact location in the source code where things go wrong.
  • Thread Profiler is the tool which provides performance analysis for threaded applications. For each parallel region in the code, scalability extrapolations can be performed, provided a sufficient number of program runs with varying number of threads are performed.

Troubleshooting with the Intel, PGI and GNU Compilers