gprof

Profiling allows you to learn where your program spent its time and which functions called which other functions while it was executing. This information helps you to see in which functions most of the time is spent and thus are candidates for optimization efforts.

Profiling involves three steps:

  • You must compile and link your program with profiling enabled (-p, and if you want line information -g)
  • Execution of the program to generate a profile data file
  • Execution of a profiling program (like gprof) to analyze the profile data

Compilation

The first step in generating profile information is to compile and link the program with profiling enabled. For the Intel Compilers this is done by setting the options -g and -p.

Note: It is important that the -p option is used not only for the compilation but also to link separately precompiled object files. For the Intel compiler, the default optimization is -O2. However if -g is specified, the default optimization is -O0. To profile an optimized code built with debugging symbols, -O2 or -O3  must be explicitly specified after the -pg option.

ifort -pg -O3 -c x1.f
ifort -pg -O3 -c x2.f
ifort -o a.out -pgi -O3 x1.o x2.o

Execution:

The execution of the program doesn't differ from the normal execution. The executable is invoked with the same parameters and filenames and should produce the same output. In addtition it generates an output file with profiling information with the name gmon.out in the current working directory.

Profile data analysis

The generated profile data in gmon.out are analysed with the gprof command. gprof is executed with the following parameters:

gprof options [executable-file [profile-data-files...]] [> outfile]

For a detailed discussion of all options please see the gprof manpage. In this documentation we only give some short examples using the most common options.
The default output consists of three parts: A flat profile, the call graph, and  alphabetical index of the routines.

The Flat Profile

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 66.43      7.96     7.96        1     7.96     9.64  hamilt_
 11.15      9.30     1.34        1     1.34     1.34  outwin_
  8.14     10.28     0.98        1     0.98     1.07  hns_
  2.86     10.62     0.34  5266813     0.00     0.00  zheevx2_
  1.20     10.76     0.14                             __libm_error_support
  1.04     10.89     0.12                             zher2m_
  0.90     11.00     0.11                             cvtas_s_to_a
  0.75     11.09     0.09                             select_
  0.38     11.13     0.05                             cvt_ieee_t_to_text_ex
  0.37     11.18     0.04   445351     0.00     0.00  seclit_
...

The Flat profile contains the list of subroutines in the executable, sorted by the time used. It also gives the number of calls to the subroutine and the time spent in the routine itsself and the total time (including the subroutines of the given routine). This information is missing for routines not compiled with the -p option.

The Call Graph

the call graph shows how much time was spent in each function and its children. From this information, you can find functions that, while they themselves may not have used much time, called other functions that did use unusual amounts of time.

Here is a sample call from a small program. This call came from the same gprof run as the flat profile example in the previous chapter.

granularity: each sample hit covers 2 byte(s) for 20.00% of 0.05 seconds
index % time    self  children    called     name
                                             <spontaneous>
[1]    100.0    0.00    0.05                 start [1]
                0.00    0.05       1/1       main [2]
                0.00    0.00       1/2       on_exit [28]
                0.00    0.00       1/1       exit [59]
-----------------------------------------------
                0.00    0.05       1/1       start [1]
[2]    100.0    0.00    0.05       1         main [2]
                0.00    0.05       1/1       report [3]
-----------------------------------------------
                0.00    0.05       1/1       main [2]
[3]    100.0    0.00    0.05       1         report [3]
                0.00    0.03       8/8       timelocal [6]
                0.00    0.01       1/1       print [9]
                0.00    0.01       9/9       fgets [12]
                0.00    0.00      12/34      strncmp <cycle 1> [40]
                0.00    0.00       8/8       lookup [20]
                0.00    0.00       1/1       fopen [21]
                0.00    0.00       8/8       chewtime [24]
                0.00    0.00       8/16      skipspace [44]

Profiling MPI code

Some care must be applied to use gprof  for MPI applications, because the profiling output from each task must be handled via an individually named file i.e., clobbering the single gmon.out file must be avoided. The most relevant scenarios are described in the following.

IBM MPI:

Programs built with profiling options against IBM PE will automatically generate subdirectories profdir.0_<task_id> into which the gmon.out for the designated MPI task is placed. You can now analyze either a single one of these gmon.out files, or the whole bunch at a time.

The variable MP_PROFDIR allows you to override the directory name into which POE/mpiexec stores the gmon.out file. For example, executing an MPI program with 32 tasks after setting MP_PROFDIR to "myprog" will create directories myprog.0_0, ... myprog.0_31 in which the gmon.out files are placed.

Using a wrapper script:

For these, please write a shell script "gprof-wrapper"  that contains the following commands,  set its executable bit, and start up that script with the mpiexec command. :

!/bin/bash
PDIR=profdir.$PMI_RANK
export GMON_OUT_PREFIX=${GMON_OUT_PREFIX:-gmon.out.mpi}.$PMI_RANK
exec $@

chmod o+x gprof-wrapper
mpiexec -n 128 ./gprof-wrapper  ./mpiprog arg1 arg2 arg2 
gprof ./mpiprog.exe ./gmon.out.mpi.*

This will generate a separate subdirectory for each MPI rank, and place the gmon.out.MPI_RANK.<pid> files indexed by rank and process ID. Note that the above is tailored to Intel MPI, other MPI flavours will need to use their individual methods of picking up the MPI rank inside the wrapper script.. LRZ provides a bit more general wrapper script for sequential, Intel MPI and IBM MPI applications:

module load lrztool
export GMON_OUT_PREFIX=gmon.out.mpi
mpiexec -n 128 gprof-wrapper ./mpiprog arg1 arg2 arg2 

References and further information

Pfeil nach oben