ALIs

kommt noch

Optimization and Tuning for Intel Processors

Performance measurement, optimization and tuning for Intel Processors running the Linux operating environment


Table of contents


Background information and manuals


Selected Intel Compiler Options

Option

Description

-O0

Disables all optimizations. Recommended for program development and debugging

-O1

Enables optimization for speed, while being aware of code size (e.g no loop unrolling)

-O2

Default optimization. Optimizations for speed, including global code scheduling, software pipelining, predication, and speculation.

-O3

-O2 optimizations plus more aggressive optimizations such as prefetching, scalar replacement, and loop transformations. Enables optimizations for technical computing applications (loop-intensive code): loop optimizations and data prefetch.

-Oi

Inline expansion of intrinsic functions

-fno_alias

Specifies that aliasing should not be assumed in the program. Allows the compiler to generate faster code.

-ftz

Enables flush denormal results to zero (default with -O3)

-ipo

Enables interprocedural (IP) optimizations, e.g. inline function expansion for calls to functions defined in separate files

-p

Compiles and links for function profiling with gprof

-prof_genx

Instruments a program for profiling with codecov

-prof_use

Use formerly collected profiling information during optimization

-g

Produces a symbol tables, i.e. line numbers for profiling are available.

-openmp

Enables the parallelizer to generate multithreaded code based on OpenMP directives.

-parallel

Tells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. To use this option, you must also specify -O2 or -O3.

-opt_report

generate an optimization report to stderr.


Chacteristics of performance tools

What to detect / What to use

  • First-pass detection of  program-wide performance issues: pfmon, lipfpm (sgi only)

  • Detection of procedures or source-code blocks which have problems: pfmon, profile.pl, i2prof.pl, histx (sgi only)

  • Detailed information about particular regions of a code: libezpm(sgi only), PAPI

  • Very detailed analysis of an applications's behaviour: VTune

  • First-pass detection of  program-wide MPI  issues: MPI_STATS (sgi only)

  • Very detailed analysis of an applications's behaviour: vampir

The characteristics for the various tools are given in the following table

Tool

insert directives into code

specific compiler flags are needed

specific linker flags are needed

specific libraries are needed

specific runtime settings are needed

prost-
processing is needed

runtime overhead

post- processing overhead or  work effort

suitable for OpenMP programms

suitable for MPI programs

Avail-
ability

timer

yes

no

no

(yes)

no

no

low

low

(yes)

(yes)

all systems

gprof

no

-p

-p

no

no
only for MPI

yes

low

low

yes

no

all systems

codecov

no

-prof_genx

-prof_genx

no

no

yes

may be high

low

yes

yes

with intel compiler

histx

no

optional: -g

optional: -g

no

no

 yes

low

low

yes

yes

sgi systems

libezpm

yes

no

no

yes

no

no

medium

low

yes

yes

sgi systems

SGI MPI Statistics

no

no

no

no

yes

no

low

low

(no)

yes

sgi systems

vampir

optional

no

no

yes

optional

yes

medium

high

yes

yes

all systems

scalasca

optional,
automatic

no

no

yes

optional

yes

low

medium

yes

yes

SuperMUC

IPM

optional

no

yes

yes

no

yes

low

low

no

yes

SuperMUC

PAPI

yes

no

no

yes

no

no

medium

low

yes

yes

all systems

Vtune

 no

optional: -g

optional: -g

no

no

 yes

low

high

yes

yes

all intel systems


Timers

Timing commands

     time command options
returns real, user and system time for the specified command

Timing routines

Routine

Library or
link options

Purpose

approx. accuracy 

suitable for multithreaded programs

Remarks

Very accurate timers:

Real (kind=8) function DWALLTIME()
double dwalltime()

/lrz/sys/lib/liblrz.a

wall clock time

1 µs 

yes

uses gettimeofday()

mpi_wtime()

-lmpi

wall clock time

1 µs 

yes

uses mpif90 compiler

omp_get_wtime

-openmp

wall clock time

1 µs 

yes

use -openmp compiler flag

Integer (kind=8) function GET_CYCLES()
long int get_cycles()

/rz/sys/lib/liblrz.a

cycles between calls

40 cycles 

no

most accurate timer

Standard timers:

Real (kind=8) function DCPUTIME()
double dwalltime()

/lrz/sys/lib/liblrz.a

CPU time

1 ms

yes

uses getrusage()

Real (kind=4) function SECOND, 
Real (kind=8) function DSECND

-l scs,
-lscs_mp

CPU time

1 ms

yes

only on SGI

Real (kind=4) CPU_TIME

 

CPU time

1 ms

no ?

 

real(kind=4) function ETIME (time) 
real(kind=4) time(2)

 

measure CPU time:
user time in time(1),  system time in time(2); function return value: sum of time(1) and time(2)

1 ms

 no ?

 

Real(kind=4) function TIMEF()

 

wall clock time

1 ms

no ?

 

call DATE_AND_TIME(......)

 

date and time

1 ms

yes

Fortran90 routine


Profilers

gprof

"gprof" produces an execution profile of programs. The effect of called routines is incorporated in the profile of each caller. The profile data is taken from the call graph profile file (gmon.out default) which is created by programs that are compiled with the -p option. The -p option also links in versions of the library routines that are compiled for profiling. "gprof" reads the given object file (the default is "a.out") and establishes the relation between its symbol table and the call graph profile from gmon.out. If more than one profile file is specified, the "gprof" output shows the sum of the profile information in the given profile files. "gprof" calculates the amount of time spent in each routine. Next, these times are propagated along the edges of the call graph. Cycles are discovered, and calls into a cycle are made to share the time of the cycle.

codecov

The Intel Compiler Code-Coverage and Test-Priorization Tools are part of the Intel Compiler suite. It is based on the Intel profile-guided optimization (PGO) technology and only available for Intel Compilers.

The routines have to be compiled with the -prof_genx option.

Codecov generates an HTML Page with information which code was executed. This information is useful for developers to discover the parts of their code when applied against specified workloads. It does not provide timing information, so the usability for tuning of programs is limited.

Profile Guided Optimization

The routines have to be compiled with the -prof_gen[x] option.

The instrumented executable is run one or more times with different typical data sets. The dynamic profiling information is merged by profmerge and the combined information is used to generate a profile-optimized excecutable by compiling with -prof_use.

Profilers and Instrumentation Using Hardware Counters

Note on Using Hardware Counters on HLRB II

If you would like to carry out performance measurements in batch jobs on HLRB II it is possible to switch off LRZ's performance monitoring by running the command

              /lrz/sys/lrz_perf/bin/lrz_perf_off_hlrb2
in your batch job before starting your own measurements. This however will block the execution of your code for up to 5 minutes (until LRZ measurements evenutally in progress can terminate). (On the interactive node a01 of HLRB II there are no periodic measurements by LRZ, so you do not have to switch them off there).

Available Counters and their Meaning

Unfortunately the names for accesssing the hardware counters are not consistent over the variety of tools.

The corresponding name and their meaning is given in a separate table

persystreport

New performance properties are available on the database collected by the PerSyst Monitoring system at LRZ. Issue the command

               persystreport

to get performance information about your jobs which where submitted after December 2010.

The command generates as a default an html web page. It has also the capability of generating a detailed report.

jobperf

Since LRZ monitors all jobs, performance counters can retrieved from the data base. Just issue the command

      jobperf
to get information about your jobs submitted before December 2010. In an PBS job it is very appropriate to have the following as the last  line of the job script:
      jobperf -j $PBS_JOBID
The command displays the average per core.

---------+------------------+------------------+
| JOB-ID | MFLOPS           | MIPS             |
+--------+------------------+------------------+
|   1866 |       0.00642653 |    2245.31379330 |
|   1870 |      38.03997692 |    2366.53533088 |
|   2090 |     317.56942103 |    2675.16215452 |
|   2301 |    1145.92991600 |    2947.50605995 |
|   3467 |    2193.10257653 |    2648.66290273 |



qprofile

qprofile is a LRZ wapper script. OpenMP programs cannot be analyzed with qprofile. 
qprofile can by used to profile a given application with respect to time or one performance counter.

Load the qprofile-family of commands by: module load qprofile

pfmon

pfmon presents an easy-to-use interface to the performance monitoring counters on Intel Itanium2 processors.

  • collects and analyzes performance data on Linux systems
  • evetn-based sampling
  • applications (both user and kernel level)
  • no special build required
  • low overhead
     
  • details

profile.pl

profile.pl is an interface to pfmon. It is a Perl script to run profile experiments using pfmon. profile.pl is a Perl script that provides a simple way to do procedure- level profiling of an unmodified binary. profile.pl uses dplace to control binding of processes to processors

i2prof.pl

 

The i2prof.pl script, from the Center for Parallel Computers, KTH, Sweden, is a very useful interpreter for the pfmon program. i2prof.pl is another user friendly interface to perfmon. A perl script will analyse the user request and run pfmon (if necessary multiple times) to collect the data to calculate the metrics from several fine-grained PMU events.

histx

histx  is a set of tools and libraries that can assist with performance analysis and bottleneck identification in
IA-64 applications running on Linux. histx comes with the following programs:

Details:

  • lipfpm - reports per-thread Itanium PMU event counts
  • samppm - collects per-thread PMU event counts at a regular interval
  • dumppm - dumps binary files produced by samppm in human/script readable form. 
  • histx - profiling tool that samples instruction pointer (ip) or callstack on timer or PMU related events 
  • iprep - sorts, merges, and "pretty print"s one or more raw ip sampling reports produced by histx
  • csrep - produces "butterfly report" from one or more raw callstack sampling reports produced by histx
  • tpt  - (Trace Posix Threads) collects data on the use of certain Posix threads functions that can strongly impact application performance.

histx comes with the following libraries:

  • libezpm.so - an "easy to use" library for accessing PMU event counts from within an application
  • libhistx.so - provides calls that allow applications to enable and disable monitoring by lipfpm, samppm, and histx.

PAPI

(PAPI) Performance Application Programming Interface aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. PAPI enables software engineers to see, in near real time, the relation between software performance and processor events.

VTune: Intel Visual Tuning Environment

Used to identify and locate performance bottlenecks in code. TheVTune analyzer collects, analyzes, and displays software performance data from the system-wide view down to a specific function, module, or instruction. VTune is available in native Linux form as a command-line version and also as a product with a Linux data collector, requiring a Windows console for display of  graphical results.

Because VTune has many system impacts and pitfalls, usage of VTune is restricted by specific permissions. Contact LRZ for getting access and help.


MPI and OpenMP Profiling

Guideview

GuideView is a tool that displays the performance details of a OpenMP program's parallel execution.

SGI MPI Statistics

 

The environment variable

               MPI_STATS

enables printing of MPI internal statistics. Each MPI process prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. To prefix the statistics messages with the MPI rank,
use the -p option on the mpirun command. 

NOTE: Because the statistics-collection code is not thread-safe, this variable should not be set if the program usesthreads.

Intel Tracing Tools/Vampir

The Intel Tracing Tools, comprised of Trace Collector and Trace Analyzer, support the development and tuning of programs parallelized using the MPI message passing interface. Using these tools enables you to investigate the communication structure of your parallel program, and hence to isolate incorrect and/or inefficient MPI programming.

  • Trace Collector provides a MPI tracing library which produces tracing data collected during a typical program run; these tracing data are written to disk in an efficient storage format for subsequent analysis.
  • Trace Analyzer provides a GUI for analysis of the tracing data
  • details

Scalasca

Scalasca (SCalable performance Analysis of LArge SCale parallel Applications) is an open-source project developed in the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications, yet presenting an advanced and user-friendly graphical interface. Scalasca can be used to help identify bottlenecks and optimization opportunities in application codes by providing a number of important features: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies; flexibility (to focus only on what really matters), user friendliness; and integration with PAPI hardware counters for performance analysis (HLRB2 only).

IPM - Integrated Performance Monitoring

IPM is a portable profiling infrastructure for parallel codes in C and Fortran. It provides a low-overhead performance profile of the performance aspects and resource utilization in a parallel program. Communication, computation, and IO are the primary focus. 

Threading Tools

The Threading Tools allow you to perform correctness and performance checking on multi-threaded applications (running in shared memory). The parallelization method may be based on POSIX or Linux Threads, or on OpenMP. For OpenMP applications it is necessary to use the Intel compilers in combination with suitably chosen compiler switches to perform the analysis of applications.

  • Thread Checker is the tool which identifies and locates threading issues. Very often, concurrency problems (race conditions) are overlooked by the user during the parallelization process. This tool reliably identifies all problems of this kind; under the right conditions it is also possible to specify the exact location in the source code where things go wrong.

  • Thread Profiler is the tool which provides performance analysis for threaded applications. For each parallel region in the code, scalability extrapolations can be performed, provided a sufficient number of program runs with varying number of threads are performed. 

  • Details


Detecting Memory Leaks

Valgrind

For finding memory leaks, measuring memory consumption as well as identifying performance bottlenecks, Valgrind is available for the x86-64 systems in the Linux Cluster. 

MemoryScape

This tool provides a subset of Totalview functionality to detect memory leaks. Since Valgrind is not available on IA64 based systems, it is the recommended tool on these systems.