likwid - light weight performance tools

Likwid stands for Like I knew what I am doing. This project contributes easy to use command line tools for Linux to support programmers in developing high performance multi threaded programs.

Likwid Tools

Likwid contains the following tools:

  • likwid-topology: Show the thread and cache topology
  • likwid-perfctr: Measure hardware performance counters on Intel and AMD processors
  • likwid-pin: Pin your threaded application without touching your code (supports pthreads, Intel OpenMP and gcc OpenMP)
  • likwid-perfscope: Frontend for likwid-perfctr timeline mode. Allows live plotting of performance metrics.
  • likwid-powermeter: Tool for accessing RAPL counters and query Turbo mode steps on Intel processor.

Version 3.1

Version 3.1 is available for SuperMUC Phase 1 only (Westmere-EX and Sandy Bridge-EP architectures). The likwid module requires one of the two modules: mpi.ibm or mpi.intel. Load one of them even if you intend to use an application without MPI.

likwid-perfctr needs to know which CPU has to be measured.  The Script likwid-mpi-wrapper for Version 3.1 provides the transformation of the MPI ranks to the CPUs and pins the processes. Example:

#Define Control environment variables here 
mpiexec likwid-mpi-wrapper ./myexecutable


The following control environment variables are available:
IBM MPI pure MPI Intel MPI pure MPI
module load likwid
export LIKWID_OUTPUT=myexe
export LIKWID_GROUP=FLOPS_DP
export LIKWID_VERBOSE=yes
mpiexec likwid-mpi-wrapper ./myexe 
module unload mpi.ibm
module load mpi.intel
module load likwid
export OMP_NUM_THREADS=<1 or number>
export I_MPI_PIN=YES
export I_MPI_PIN_CELL=core
export LIKWID_OUTPUT=myexe
export LIKWID_GROUP=FLOPS_DP
export LIKWID_VERBOSE=yes
mpiexec likwid-mpi-wrapper ./myexe 
IBM MPI hybrid Intel MPI hybrid
module load likwid
export OMP_NUM_THREADS=<1 or number>
export LIKWID_OUTPUT=myexe
export LIKWID_GROUP=FLOPS_DP
export LIKWID_VERBOSE=yes
mpiexec likwid-mpi-wrapper ./myexe 
module unload mpi.ibm
module load mpi.intel
module load likwid
expoprt OMP_NUM_THREADS=<1 or number>
export I_MPI_PIN=YES
export I_MPI_PIN_CELL=core
export I_MPI_PIN_DOMAIN=omp:compact
export LIKWID_OUTPUT=myexe
export LIKWID_GROUP=FLOPS_DP
export LIKWID_VERBOSE=yes
mpiexec likwid-mpi-wrapper ./myexe 

Variables for Control:

LIKWID_OUTPUT: Name of the output file: Output is written in the following form: name-mpirank_hostname_pid.txt. If An OpenMP is executed, then the output file also contains the information the core on which the threads have been run.

LIKWID_GROUP: Group to be measured. see likwid-perfctr -a

LIKWID_VERBOSE: If yes, the command to executed will be displayed

Example output

CPU type:   Intel Westmere EX processor
CPU clock:  2.39 GHz
Measuring group FLOPS_DP
-------------------------------------------------------------
/lrz/sys/tools/lrztools/1.0/src/placementtest-mpi.intel
+--------------------------------------+-------------+-------------+
|                Event                 |   core 18   |   core 19   |
+--------------------------------------+-------------+-------------+
|          INSTR_RETIRED_ANY           | 1.20773e+08 | 9.63163e+07 |
|        CPU_CLK_UNHALTED_CORE         | 5.43772e+07 | 7.10149e+07 |
|         CPU_CLK_UNHALTED_REF         | 5.44599e+07 | 7.48141e+07 |
|    FP_COMP_OPS_EXE_SSE_FP_PACKED     | 1.21252e+06 | 1.21254e+06 |
|    FP_COMP_OPS_EXE_SSE_FP_SCALAR     |      2      |     776     |
| FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION | 1.16212e+06 | 1.16214e+06 |
| FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION |    50399    |    51174    |
+--------------------------------------+-------------+-------------+
+-------------------------------------------+-------------+-------------+-------------+-------------+
|                   Event                   |     Sum     |     Max     |     Min     |     Avg     |
+-------------------------------------------+-------------+-------------+-------------+-------------+
|          INSTR_RETIRED_ANY STAT           | 2.17089e+08 | 1.20773e+08 | 9.63163e+07 | 1.08545e+08 |
|        CPU_CLK_UNHALTED_CORE STAT         | 1.25392e+08 | 7.10149e+07 | 5.43772e+07 | 6.26961e+07 |
|         CPU_CLK_UNHALTED_REF STAT         | 1.29274e+08 | 7.48141e+07 | 5.44599e+07 | 6.4637e+07  |
|    FP_COMP_OPS_EXE_SSE_FP_PACKED STAT     | 2.42505e+06 | 1.21254e+06 | 1.21252e+06 | 1.21253e+06 |
|    FP_COMP_OPS_EXE_SSE_FP_SCALAR STAT     |     778     |     776     |      2      |     389     |
| FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION STAT | 2.32426e+06 | 1.16214e+06 | 1.16212e+06 | 1.16213e+06 |
| FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION STAT |   101573    |    51174    |    50399    |   50786.5   |
+-------------------------------------------+-------------+-------------+-------------+-------------+
+----------------+-------------+-----------+
|     Metric     |   core 18   |  core 19  |
+----------------+-------------+-----------+
|  Runtime [s]   |  0.0227145  | 0.0296644 |
|  Clock [MHz]   |   2390.31   |  2272.38  |
|      CPI       |  0.450243   |  0.73731  |
|    MFlops/s    |   9.86795   |  9.87128  |
| Packed MUOPS/s |   4.93397   |  4.93406  |
| Scalar MUOPS/s | 8.13841e-06 | 0.0031577 |
|   SP MUOPS/s   |   4.72889   |  4.72898  |
|   DP MUOPS/s   |  0.205084   | 0.208237  |
+----------------+-------------+-----------+
+---------------------+------------+-----------+-------------+------------+
|       Metric        |    Sum     |    Max    |     Min     |    Avg     |
+---------------------+------------+-----------+-------------+------------+
|  Runtime [s] STAT   | 0.0523789  | 0.0296644 |  0.0227145  | 0.0261895  |
|  Clock [MHz] STAT   |  4662.69   |  2390.31  |   2272.38   |  2331.34   |
|      CPI STAT       |  1.18755   |  0.73731  |  0.450243   |  0.593776  |
|    MFlops/s STAT    |  19.7392   |  9.87128  |   9.86795   |  9.86961   |
| Packed MUOPS/s STAT |  9.86803   |  4.93406  |   4.93397   |  4.93401   |
| Scalar MUOPS/s STAT | 0.00316584 | 0.0031577 | 8.13841e-06 | 0.00158292 |
|   SP MUOPS/s STAT   |  9.45787   |  4.72898  |   4.72889   |  4.72894   |
|   DP MUOPS/s STAT   |  0.413321  | 0.208237  |  0.205084   |  0.206661  |
+---------------------+------------+-----------+-------------+------------+

Version 4.0

The default version is 4.0 which works on the following systems: SuperMUC Phase 1, SuperMUC Phase 2, and CooLMUC2.
Version 4.0 has no likwid-mpi-wrapper. However, the likwid-mpirun works when using arguments for pinning. Note that only mpi.intel is supported. Here is an example that uses pinning (2 Sandy Bridge-EP Nodes):

likwid-mpirun -np 32 -nperdomain S:8 -g CLOCK ./myexecutable

This command pins 8 tasks per socket. More documentation on likwid-mpirun is found here.
Known issues: at the moment only one argument can be passed to the mpi executable (or the application). Use quotes to surround binary and arguments. For example (2 Haswell-EP Nodes):
likwid-mpirun -np 56 -nperdomain S:14 -g ENERGY "./myexecutable 10"

This command pins 14 tasks per socket (requires 4 sockets, i.e. 2 Nodes).

Further information