ALIs
kommt nochOptimization and Tuning for Intel Processors
Performance measurement, optimization and tuning for Intel Processors running the Linux operating environment
Table of contents
- Background information and manuals
- Selected Intel Compiler Options
- Chacteristics of performance tools
- Timers
- Profilers
- Profilers and Instrumentation Using Hardware Counters
- MPI and OpenMP Profiling
- Threading Tools
- Detecting Memory Leaks
Background information and manuals
Selected Intel Compiler Options
|
Option |
Description |
|---|---|
|
-O0 |
Disables all optimizations. Recommended for program development and debugging |
|
-O1 |
Enables optimization for speed, while being aware of code size (e.g no loop unrolling) |
|
-O2 |
Default optimization. Optimizations for speed, including global code scheduling, software pipelining, predication, and speculation. |
|
-O3 |
-O2 optimizations plus more aggressive optimizations such as prefetching, scalar replacement, and loop transformations. Enables optimizations for technical computing applications (loop-intensive code): loop optimizations and data prefetch. |
|
-Oi |
Inline expansion of intrinsic functions |
|
-fno_alias |
Specifies that aliasing should not be assumed in the program. Allows the compiler to generate faster code. |
|
-ftz |
Enables flush denormal results to zero (default with -O3) |
|
-ipo |
Enables interprocedural (IP) optimizations, e.g. inline function expansion for calls to functions defined in separate files |
|
-p |
Compiles and links for function profiling with gprof. |
|
-prof_genx |
Instruments a program for profiling with codecov |
|
-prof_use |
Use formerly collected profiling information during optimization |
|
-g |
Produces a symbol tables, i.e. line numbers for profiling are available. |
|
-openmp |
Enables the parallelizer to generate multithreaded code based on OpenMP directives. |
|
-parallel |
Tells the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. To use this option, you must also specify -O2 or -O3. |
|
-opt_report |
generate an optimization report to stderr. |
Chacteristics of performance tools
What to detect / What to use
-
First-pass detection of program-wide performance issues: pfmon, lipfpm (sgi only)
-
Detection of procedures or source-code blocks which have problems: pfmon, profile.pl, i2prof.pl, histx (sgi only)
-
Detailed information about particular regions of a code: libezpm(sgi only), PAPI
-
Very detailed analysis of an applications's behaviour: VTune
-
First-pass detection of program-wide MPI issues: MPI_STATS (sgi only)
-
Very detailed analysis of an applications's behaviour: vampir
The characteristics for the various tools are given in the following table
|
Tool |
insert directives into code |
specific compiler flags are needed |
specific linker flags are needed |
specific libraries are needed |
specific runtime settings are needed |
prost- |
runtime overhead |
post- processing overhead or work effort |
suitable for OpenMP programms |
suitable for MPI programs |
Avail- |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
timer |
yes |
no |
no |
(yes) |
no |
no |
low |
low |
(yes) |
(yes) |
all systems |
|
gprof |
no |
-p |
-p |
no |
no |
yes |
low |
low |
yes |
no |
all systems |
|
codecov |
no |
-prof_genx |
-prof_genx |
no |
no |
yes |
may be high |
low |
yes |
yes |
with intel compiler |
|
histx |
no |
optional: -g |
optional: -g |
no |
no |
yes |
low |
low |
yes |
yes |
sgi systems |
|
libezpm |
yes |
no |
no |
yes |
no |
no |
medium |
low |
yes |
yes |
sgi systems |
|
SGI MPI Statistics |
no |
no |
no |
no |
yes |
no |
low |
low |
(no) |
yes |
sgi systems |
|
vampir |
optional |
no |
no |
yes |
optional |
yes |
medium |
high |
yes |
yes |
all systems |
| scalasca |
optional, |
no |
no |
yes |
optional |
yes |
low |
medium |
yes |
yes |
SuperMUC |
| IPM |
optional |
no |
yes |
yes |
no |
yes |
low |
low |
no |
yes |
SuperMUC |
|
PAPI |
yes |
no |
no |
yes |
no |
no |
medium |
low |
yes |
yes |
all systems |
|
Vtune |
no |
optional: -g |
optional: -g |
no |
no |
yes |
low |
high |
yes |
yes |
all intel systems |
Timers
Timing commands
time command options
returns real, user and system time for the specified command
Timing routines
|
Routine |
Library or |
Purpose |
approx. accuracy |
suitable for multithreaded programs |
Remarks |
|
Very accurate timers: |
|||||
|
Real (kind=8) function DWALLTIME() |
/lrz/sys/lib/liblrz.a |
wall clock time |
1 µs |
yes |
uses gettimeofday() |
|
mpi_wtime() |
-lmpi |
wall clock time |
1 µs |
yes |
uses mpif90 compiler |
|
omp_get_wtime |
-openmp |
wall clock time |
1 µs |
yes |
use -openmp compiler flag |
|
Integer (kind=8) function GET_CYCLES() |
/rz/sys/lib/liblrz.a |
cycles between calls |
40 cycles |
no |
most accurate timer |
|
Standard timers: |
|||||
|
Real (kind=8) function DCPUTIME() |
/lrz/sys/lib/liblrz.a |
CPU time |
1 ms |
yes |
uses getrusage() |
|
Real (kind=4) function SECOND, |
-l scs, |
CPU time |
1 ms |
yes |
only on SGI |
|
Real (kind=4) CPU_TIME |
|
CPU time |
1 ms |
no ? |
|
|
real(kind=4) function ETIME (time) |
|
measure CPU time: |
1 ms |
no ? |
|
|
Real(kind=4) function TIMEF() |
|
wall clock time |
1 ms |
no ? |
|
|
call DATE_AND_TIME(......) |
|
date and time |
1 ms |
yes |
Fortran90 routine |
Profilers
gprof
"gprof" produces an execution profile of programs. The effect of called routines is incorporated in the profile of each caller. The profile data is taken from the call graph profile file (gmon.out default) which is created by programs that are compiled with the -p option. The -p option also links in versions of the library routines that are compiled for profiling. "gprof" reads the given object file (the default is "a.out") and establishes the relation between its symbol table and the call graph profile from gmon.out. If more than one profile file is specified, the "gprof" output shows the sum of the profile information in the given profile files. "gprof" calculates the amount of time spent in each routine. Next, these times are propagated along the edges of the call graph. Cycles are discovered, and calls into a cycle are made to share the time of the cycle.
codecov
The Intel Compiler Code-Coverage and Test-Priorization Tools are part of the Intel Compiler suite. It is based on the Intel profile-guided optimization (PGO) technology and only available for Intel Compilers.
The routines have to be compiled with the -prof_genx option.
Codecov generates an HTML Page with information which code was executed. This information is useful for developers to discover the parts of their code when applied against specified workloads. It does not provide timing information, so the usability for tuning of programs is limited.
Profile Guided Optimization
The routines have to be compiled with the -prof_gen[x] option.
The instrumented executable is run one or more times with different typical data sets. The dynamic profiling information is merged by profmerge and the combined information is used to generate a profile-optimized excecutable by compiling with -prof_use.
Profilers and Instrumentation Using Hardware Counters
Note on Using Hardware Counters on HLRB IIIf you would like to carry out performance measurements in batch jobs on HLRB II it is possible to switch off LRZ's performance monitoring by running the command
/lrz/sys/lrz_perf/bin/lrz_perf_off_hlrb2
in your batch job before starting your own measurements. This however will block the execution of your code for up to 5 minutes (until LRZ measurements evenutally in progress can terminate). (On the interactive node a01 of HLRB II there are no periodic measurements by LRZ, so you do not have to switch them off there). |
Available Counters and their Meaning
Unfortunately the names for accesssing the hardware counters are not consistent over the variety of tools.
The corresponding name and their meaning is given in a separate table
persystreport
New performance properties are available on the database collected by the PerSyst Monitoring system at LRZ. Issue the command
persystreport
to get performance information about your jobs which where submitted after December 2010.
The command generates as a default an html web page. It has also the capability of generating a detailed report.
jobperf
Since LRZ monitors all jobs, performance counters can retrieved from the data base. Just issue the command
jobperf
to get information about your jobs submitted before December 2010. In an PBS job it is very appropriate to have the following as the last line of the job script:
jobperf -j $PBS_JOBID
The command displays the average per core.---------+------------------+------------------+
| JOB-ID | MFLOPS | MIPS |
+--------+------------------+------------------+
| 1866 | 0.00642653 | 2245.31379330 |
| 1870 | 38.03997692 | 2366.53533088 |
| 2090 | 317.56942103 | 2675.16215452 |
| 2301 | 1145.92991600 | 2947.50605995 |
| 3467 | 2193.10257653 | 2648.66290273 |
qprofile
qprofile is a LRZ wapper script. OpenMP programs cannot be analyzed with qprofile.
qprofile can by used to profile a given application with respect to time or one performance counter.
Load the qprofile-family of commands by: module load qprofile
pfmon
pfmon presents an easy-to-use interface to the performance monitoring counters on Intel Itanium2 processors.
- collects and analyzes performance data on Linux systems
- evetn-based sampling
- applications (both user and kernel level)
- no special build required
- low overhead
- details
profile.pl
profile.pl is an interface to pfmon. It is a Perl script to run profile experiments using pfmon. profile.pl is a Perl script that provides a simple way to do procedure- level profiling of an unmodified binary. profile.pl uses dplace to control binding of processes to processors
i2prof.pl
The i2prof.pl script, from the Center for Parallel Computers, KTH, Sweden, is a very useful interpreter for the pfmon program. i2prof.pl is another user friendly interface to perfmon. A perl script will analyse the user request and run pfmon (if necessary multiple times) to collect the data to calculate the metrics from several fine-grained PMU events.
histx
histx is a set of tools and libraries that can assist with performance analysis and bottleneck identification in
IA-64 applications running on Linux. histx comes with the following programs:
Details:
- lipfpm - reports per-thread Itanium PMU event counts
- samppm - collects per-thread PMU event counts at a regular interval
- dumppm - dumps binary files produced by samppm in human/script readable form.
- histx - profiling tool that samples instruction pointer (ip) or callstack on timer or PMU related events
- iprep - sorts, merges, and "pretty print"s one or more raw ip sampling reports produced by histx
- csrep - produces "butterfly report" from one or more raw callstack sampling reports produced by histx
- tpt - (Trace Posix Threads) collects data on the use of certain Posix threads functions that can strongly impact application performance.
histx comes with the following libraries:
- libezpm.so - an "easy to use" library for accessing PMU event counts from within an application
- libhistx.so - provides calls that allow applications to enable and disable monitoring by lipfpm, samppm, and histx.
PAPI
(PAPI) Performance Application Programming Interface aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. PAPI enables software engineers to see, in near real time, the relation between software performance and processor events.
VTune: Intel Visual Tuning Environment
Used to identify and locate performance bottlenecks in code. TheVTune analyzer collects, analyzes, and displays software performance data from the system-wide view down to a specific function, module, or instruction. VTune is available in native Linux form as a command-line version and also as a product with a Linux data collector, requiring a Windows console for display of graphical results.
Because VTune has many system impacts and pitfalls, usage of VTune is restricted by specific permissions. Contact LRZ for getting access and help.
MPI and OpenMP Profiling
Guideview
GuideView is a tool that displays the performance details of a OpenMP program's parallel execution.
SGI MPI Statistics
The environment variable
MPI_STATS
enables printing of MPI internal statistics. Each MPI process prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. To prefix the statistics messages with the MPI rank,
use the -p option on the mpirun command.
NOTE: Because the statistics-collection code is not thread-safe, this variable should not be set if the program usesthreads.
Intel Tracing Tools/Vampir
The Intel Tracing Tools, comprised of Trace Collector and Trace Analyzer, support the development and tuning of programs parallelized using the MPI message passing interface. Using these tools enables you to investigate the communication structure of your parallel program, and hence to isolate incorrect and/or inefficient MPI programming.
- Trace Collector provides a MPI tracing library which produces tracing data collected during a typical program run; these tracing data are written to disk in an efficient storage format for subsequent analysis.
- Trace Analyzer provides a GUI for analysis of the tracing data
- details
Scalasca
Scalasca (SCalable performance Analysis of LArge SCale parallel Applications) is an open-source project developed in the Jülich Supercomputing Centre (JSC) which focuses on analyzing OpenMP, MPI and hybrid OpenMP/MPI parallel applications, yet presenting an advanced and user-friendly graphical interface. Scalasca can be used to help identify bottlenecks and optimization opportunities in application codes by providing a number of important features: profiling and tracing of highly parallel programs; automated trace analysis that localizes and quantifies communication and synchronization inefficiencies; flexibility (to focus only on what really matters), user friendliness; and integration with PAPI hardware counters for performance analysis (HLRB2 only).
IPM - Integrated Performance Monitoring
IPM is a portable profiling infrastructure for parallel codes in C and Fortran. It provides a low-overhead performance profile of the performance aspects and resource utilization in a parallel program. Communication, computation, and IO are the primary focus.
Threading Tools
The Threading Tools allow you to perform correctness and performance checking on multi-threaded applications (running in shared memory). The parallelization method may be based on POSIX or Linux Threads, or on OpenMP. For OpenMP applications it is necessary to use the Intel compilers in combination with suitably chosen compiler switches to perform the analysis of applications.
-
Thread Checker is the tool which identifies and locates threading issues. Very often, concurrency problems (race conditions) are overlooked by the user during the parallelization process. This tool reliably identifies all problems of this kind; under the right conditions it is also possible to specify the exact location in the source code where things go wrong.
-
Thread Profiler is the tool which provides performance analysis for threaded applications. For each parallel region in the code, scalability extrapolations can be performed, provided a sufficient number of program runs with varying number of threads are performed.
Detecting Memory Leaks
Valgrind
For finding memory leaks, measuring memory consumption as well as identifying performance bottlenecks, Valgrind is available for the x86-64 systems in the Linux Cluster.
MemoryScape
This tool provides a subset of Totalview functionality to detect memory leaks. Since Valgrind is not available on IA64 based systems, it is the recommended tool on these systems.