ALIs
kommt nochValgrind: Debugging and profiling of executables on x86 and x86_64
Introduction
Valgrind is a flexible system for debugging and profiling Linux executables for x86_64 architectures. The functionality of Valgrind covers
-
Memcheck/Addrcheck: Detection of memory-management problems
-
Cachegrind: Cache profiler. Detailed simulation of the I1, D1 and L2 caches is provided to pinpoint the sources of cache misses.
-
Callgrind: adds call graph tracing to cachegrind. It can be used to get call counts and inclusive cost for each call happening in your program. In addition to cachegrind, callgrind can annotate threads separately, and every instruction of disassembler output of your program with the number of instructions executed and cache misses incurred.
-
Massif: Memory consumption profiling.
-
Helgrind: Identify data races in multithreaded programs
Valgrind documentation
|
Version 3.5 |
The Valkyrie GUI is also available. |
PDF (1 MByte) |
|
Version 3.6 |
aggregate Release Notes. |
PDF (1.1 MByte) |
Please also refer to the Valgrind home page for up-to-date information.
Using the Valgrind installation at LRZ
To initialize Valgrind, please first load the appropriate environment module:
module load valgrind
which will provide suitable environment settings for using Valgrind.
For most cases, Valgrind will run your program considerably slower than usual, and also may use more resources, so it is recommended to cut down on problem sizes for profiling.
Example programs
A number of example programs are available, which illustrate the various usage possibilities. Please refer to the Valgrind documentation linked above for detailed descriptions. The archive contains the following files
-
A Makefile for building the examples. It also contains targets illustrating the call parameters you need to specify for each case
-
Examples for Memcheck functionality: Fortran programs (leak1.f90) with and (noleak1.f90) without memory leaks, respectively.
-
Example for memory profiling (heap1.f90)
-
Example for cache profiling (cache1.f90). This includes a dummy routine which prevents the compiler from optimizing away the loop.
Example output for leak checking
These programs, incidentally, illustrate a workaround for pointer component memory leakage. However, there is still some dependance on the compiler release in whether memory leaks are detected. Run the two provided examples by typing make run1 and make run2, respectively. When using the Intel Compiler, the following information will appear at the end of the output:
|
run1 (with leak) |
|---|
==PID== LEAK SUMMARY: ==PID== definitely lost: 384 bytes in 4 blocks. ==PID== possibly lost: 0 bytes in 0 blocks. ==PID== still reachable: 0 bytes in 0 blocks. ==PID== suppressed: 0 bytes in 0 blocks. |
| run2 (both leakage sources fixed) |
==PID== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 25 from 2) ==PID== malloc/free: in use at exit: 0 bytes in 0 blocks. ==PID== malloc/free: 13 allocs, 13 frees, 6178 bytes allocated. ==PID== For counts of detected errors, rerun with: -v ==PID== No malloc'd blocks -- no leaks are possible. |
So there are still some errors (presumably with the way the Fortran compiler handles memory for its own internal usage), but the program induced memory leaks have indeed been fixed.
Example output for memory profiling
As a result of make run3, one obtains a postscript graph of the memory consumption (illustrated below after conversion to PNG) as well as a HTML file giving annotations for the graph.
|
|
Example output for cache miss profiling
A number of iterations CI of a vector triad a = b*c + d is run with a specified vector length CL. Run the example with various settings for the problem size and iteration count, i. e.
make run4 CL=100 CI=100000 make ann_run4 make run4 CL=10000 CI=1000 make ann_run4 make run4 CL=1000000 CI=10 make ann_run4
For each run, a report is generated; For the last case indicated above, this would look somewhat like this:
==PID== ==PID== I refs: 53,050,484 ==PID== I1 misses: 2,373 ==PID== L2i misses: 1,346 ==PID== I1 miss rate: 0.0% ==PID== L2i miss rate: 0.0% ==PID== ==PID== D refs: 43,377,893 (30,283,108 rd + 13,094,785 wr) ==PID== D1 misses: 5,395,497 ( 3,769,679 rd + 1,625,818 wr) ==PID== L2d misses: 5,379,113 ( 3,753,638 rd + 1,625,475 wr) ==PID== D1 miss rate: 12.4% ( 12.4% + 12.4% ) ==PID== L2d miss rate: 12.4% ( 12.3% + 12.4% ) ==PID== ==PID== L2 refs: 5,397,870 ( 3,772,052 rd + 1,625,818 wr) ==PID== L2 misses: 5,380,459 ( 3,754,984 rd + 1,625,475 wr) ==PID== L2 miss rate: 5.5% ( 4.5% + 12.4% )
The make ann_run4 after each profiling run generates an annotated list of source lines, indicating where the cache misses actually happen by running
cg_annotate --<PID> --auto=yes
For this example the output is trivial, but for large application it may be quite difficult to otherwise find the performance bottlenecks. If the source files happen to be distributed across various directories you can use the -I switch to assist cg_annotate in finding the sources.
Example output for call graph generation
For the same program as used in the run3 target above, a line level profiling using callgrind can be performed by typing
make run5 make ann_run5
The make ann_run5 after the profiling run generates an annotated list of source lines, indicating on the line level where the time was spent, and also the number of calls for each subroutine by running
callgrind_annotate --auto=yes <result file>
