SuperMIC - Intel Xeon Phi Cluster

Account on SuperMIC

Access to SuperMIC is only granted to selected users, which already must have an active SuperMUC account. If you are interested in accessing the SuperMIC system, please submit a trouble ticket via LRZ-Servicedesk.

If you do not have a SuperMUC account yet, jou first have to submit a new project proposal. Test accounts (which have access to SuperMUC and SuperMIC) are also available. See Project-proposal.

Current Software Stack

MPSS 3.4, MLNX_OFED_LINUX-2.2- (OFED-2.2-1.0.0), Intel Parallel Studio XE 2016 compilers, MPI, MKL library, Amplifier modules

Intel Many Integrated Core (MIC) Architecture

The Intel Xeon Phi coprocessor is an advancement of the so called “Larrabee” chip which has never become a product. Larrabee was mainly targeting the computer graphics market and could only be programmed with lots of efforts under Windows. In contrast to that, Intel Xeon Phi is primarily targeting the HPC market and can be programmed using standard parallel programming techniques under Linux. Comparing with GPGPUs, one of the main advantages of the coprocessor is that the programmer can directly login on the card from the host using TCP/IP based tools like ssh. This allows the user to watch and control processes running on the coprocessor using tools like top and kill and gaining the benefit of all useful information available via the Linux /proc filesystem. Due to the large SIMD register-width (512-bit) efficient vectorisation of the code is very important for Intel Xeon Phi.

An architectural overview is given in the following figure.


A bidirectional ring interconnect connects all the cores, L2 caches and other components like the tag directories (TD), the PCIe client logic or the GDDR5 memory controllers. Details about our currently installed Intel Xeon Phi cards are summarised in the following table.

Number of cores


Frequency of cores

1.1 GHz

GDDR5 memory size

8 GB

Number of hardware threads per core


SIMD vector registers

32 (512-bit wide) per thread context


16 (DP), 32 (SP)

Theoretical peak performance

1 TFlop/s (DP), 2 TFlop/s (SP)

L2 cache per core

512 kB

System Configuration

The cluster consists of one iDataPlex rack with 32 nodes dx360 M4. Each node contains 2 Ivy-Bridge (2 x 8 cores) host processors E5-2650 @ 2.6 GHz and 2 Intel Xeon Phi (MIC) coprocessors 5110P with 60 cores @ 1.1 GHz.

Memory per node is 64 GB (host) and 2 x 8 GB (Intel Xeon Phi).

The following picture shows the iDataPlex rack of SuperMIC.


Photo: Vasileios Karakasis

Intel's Xeon Phi coprocessor is based on x86 technology and consists of  60 cores interconnected by a bidirectional ring interconnect. Each core can run up to 4 hardware threads, so max. 240 hardware threads are supported per coprocessor.

The connection from the host to the attached coprocessors is via PCIe 2.0, which limits the bandwidth to 6.2 GB/s.

The compute nodes are connected via Mellanox Infiniband FDR14 using Mellanox OFED 2.2. Through a vritual bridge interface all MIC coprocessors can be directly accessed from the SuperMIC login node and the compute nodes.

The SuperMIC login node (=login12) can be accessed from the supermuc login nodes and user-registered IP addresses.

The Intel Xeon compute nodes in rack column A are named i01r13a01, i01r13a02, ..., i01r13a16, those in rack column B are named i01r13c01, i01r13c02, ..., i01r13c16.

The 2 MIC coprocessors attached to a compute node using PCIe are named (e.g. for the host i01r13a01) i01r13a01-mic0 and i01r13a01-mic1.

2 MLNX CX3 FDR PCIe cards are attached to each CPU socket (Reiser cards).

Linux is running as operation system on both the compute hosts and the Intel Xeon Phi coprocessors.

Every compute node has a local disk attached to it.

An overview of the vaious nodes and coprocessors is shown in the following picture.



SuperMIC can be accessed only for users have already a SuperMUC user account with a registered IP-addresse. Please apply for access by contacting the LRZ HPC support  providing your name, user-ID and your static IP-addresse. After you are registered then use ssh to connect to the SuperMIC login node via: 


The SuperMIC Login node (supermic = login12) should be only used for editing, compiling and submitting programs. Compilation is also possible on the SuperMUC login nodes.

Jobs must be submitted via LoadLeveler. Once you get the reservation from the batch system, also interactive login to the reserved compute nodes and the attached MIC coprocessors is possible.

Mind that compilation on the compute nodes and coprocessors is not possible (no include files installed etc.).

Login to compute nodes:

ssh i01r13???

Login to the corresponding MIC coprocessors:

ssh i01r13???-mic0
ssh i01r13???-mic1

To be able to login to the coprocessors you have to create ssh-keys using the command "ssh-keygen". These keys must be named id_rsa and and must reside within your $HOME/.ssh directory _before_ you submit a SuperMIC job.

The content of has to be appended to the file $HOME/.ssh/authorized_keys to access also the compute nodes.

It is then possible to login to the MIC coprocessors from the reserved compute nodes or directly from the SuperMIC login node, too.

Information on how to login to SuperMUC for LRZ trainings and courses can be found here.

Budget and Quota

There are no quotas for computing time on the SuperMIC cluster yet.


The user home directories under /home/hpc/group/user and the LRZ software repositories under /lrz/sys/ (same as on SuperMUC) and /lrz/mic/ (MIC specific software and modules) are mounted on the SuperMIC login-node, compute nodes and the MIC coprocessors.

To offer a clean environment on the MICs, the default home-directory on the MICs is a local directory /home/user.

The local disks have 2 partitions:

/scratch (410 GB) for user scratch files (read + write). Content is cleaned after the job finishes.

/lrzmic (50 GB) for MIC specific software (read-only), maintained by LRZ. Depricated. MIC-specific software is now mounted globally under /lrz/mic.

Both partitions are mounted on the compute host and the 2 attached Intel Xeon Phi coprocessors.

Mind that the I/O performance of the mounted /home/hpc filesystem is quite low. In the case of MPI, the MPI executable and necessary input files should be better copied explicitly to the scratch partitions of the local disks attached to the allocated compute nodes.

Interactive and Batch Jobs

Batch Mode

Batch jobs must use the LoadLeveler class "phi". This is currently the only available class with a runtime limit of 48h. In the batch file the number of compute nodes, (i.e. the number of hosts, not the number of Intel Xeon Phi coprocessors) must be specified.

To offer a clean system environment the Xeon Phi coprocessors are reset and rebooted after every job.

Example script: Sample Job file allocating 1 compute node and the 2 Intel Xeon Phi cprocessors attached to that host:

The following batch-scripts assume that the initialdir $HOME/jobs/ already exists.

#@ wall_clock_limit = 01:00:00
#@ job_name =test
#@ job_type = parallel
#@ class = phi
#@ node = 1
#@ node_usage = not_shared
#@ initialdir = $(home)/jobs/
#@ output = test-$(jobid).out
#@ error = test-$(jobid).out
#@ notification=always
#@ queue

. /etc/profile
. /etc/profile.d/

program-doing-offloading #for noninteractive execution
sleep 10000 #for interactive access while the job is running

Submitting batch jobs:

lu65fok@login12:~/jobs> llsubmit  job1.ll
llsubmit: Processed command file through Submit Filter: "/lrz/loadl/SuperMIC/filter/".
llsubmit: The job "i01xcat3-ib.287" has been submitted.

Showing the queue:

lu65fok@login12:~/jobs> llq
Id                       Owner      Submitted   ST PRI Class        Running On
------------------------ ---------- ----------- -- --- ------------ -----------
i01xcat3-ib.287.0        lu65fok     6/4  16:51 ST 50  phi          i01r13c11-ib

1 job step(s) in queue, 0 waiting, 1 pending, 0 running, 0 held, 0 preempted

Cancelling a job:

lu65fok@login12:~/jobs> llcancel i01xcat3-ib.288.0
llcancel: Cancel command has been sent to the central manager.

Due to the rebooting of the cards cancellation of a job takes some time.

Interactive Jobs

To interactively work with the system submit a simple batch file and insert a "sleep 10000" command at the very end of the script.

During the sleeping period the allocated compute-nodes and Intel Xeon Phi coprocessors can be accessed interactively.

Programming Models

The main advantage of the MIC architecture is the possibility to program the coprocessor using plain C, C++ or Fortran and standard parallelisation models like OpenMP, MPI and hybrid OpenMP & MPI. The coprocessor can also be programmed using Intel Cilk Plus, Intel Threading Building Blocks, pthreads and OpenCL. Standard math-libraries like Intel MKL are supported and last but not least the whole Intel tool chain, e.g. Intel C/C++ and Fortran compiler, debugger and Intel VTune Amplifier. Not all methods mentioned are installed on SuperMIC yet. It is also possible to do hardware-specific tuning using Intrinsics or Assembler. However, we would not recommend doing this (except maybe for some critical kernel routines), since MIC vector Intrinsics and Assembler instructions are incompatible with SSE or AVX instructions.

An overview of the available programming models is shown in the following picture.


Generally speaking, two main execution modes can be distinguished:  native mode and offload mode. In “native mode” the Intel compiler is instructed (through the use of the compiler-switch –mmic) to cross-compile for the MIC architecture. This is also possible for OpenMP and MPI codes. The generated executable has to be copied to the coprocessor and can be launched from within a shell running on the coprocessor.

In “offload mode” the code is instrumented with OpenMP-like pragmas in C/C++ or comments in Fortran to mark regions of code that should be offloaded to the coprocessor and be executed there at runtime. The code in the marked regions can be multithreaded by using e.g. OpenMP. The generated executable must be launched from the host. This approach is quite similar to the accelerator pragmas introduced by the PGI compiler, CAPS HMPP or OpenACC to offload code to GPGPUs.

For MPI programs, MPI ranks can reside on only the coprocessor(s), on only the host(s) (possibly doing offloading), or on both the host(s) and the coprocessor(s) allowing various combinations in clusters.

The following picture demonstrates the spectrum of executiuon modes supported, reaching from Xeon-centric over symmetric to MIC-centric modes.


Intel Offload

Intel compilers offer selected pragmas for offloading code to the Intel Xeon Phi coprocessor automatically. No compiler option is needed to compile code including Offload pragmas. The following table shows an overview of the supported offload pragmas:




Multiple coprocessors

target(mic[:unit] )

Select specific coprocessors

Conditional offload

if (condition) / manadatory

Select coprocessor or host compute


in(var-list modifiersopt)

Copy from host to coprocessor


out(var-list modifiersopt)

Copy from coprocessor to host

Inputs & outputs

inout(var-list modifiersopt)

Copy host to coprocessor and back

when offload completes

Non-copied data

nocopy(var-list modifiersopt)

Data is local to target

Async. Offload


Trigger asynchronous Offload

Async. Offload


Wait for completion

The following options are supported:




Specify copy length


Copy N elements of pointer’s type

Coprocessor memory allocation

alloc_if (bool)

Allocate coprocessor space on this offload (default: TRUE)

Coprocessor memory release

free_if (bool)

Free coprocessor space at the end of this offload (default: TRUE)

Array partial allocation & variable relocation

alloc (array-slice)

into ( var-expr )

Enables partial array allocation and

data copy into other vars & ranges

The Offload pragma only offloads the code to the coprocessor, but does not parallelise the code. OpenMP can be used for parallelisation of the code as follows.

#pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n)) {
#pragma omp parallel for
for( i= 0; i< n; i++ ) {
       for( k = 0; k < n; k++ ) {
          #pragma vector aligned
          #pragma ivdep
          for( j = 0; j < n; j++ ) {
              //c[i][j] = c[i][j] + a[i][k]*b[k][j];
              c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

Compiling Programs

Compiling OpenMP Programs

The application can use OpenMP to take the advantage of the multi-core shared parallelism on the coprocessor.

To compile OpenMP programs use icc (C,C++) or ifort (Fortran) as follows:

icc -xhost -O2 -o prog prog.c
icc -mmic -O2 -o prog.mic prog.c

You can use the -openmp_report option to display diagnostic information.

Compiling MPI Programs

To compile MPI programs for Xeon and Xeon Phi use mpiicc / mpiifort, e.g.

mpiicc -xhost -O2 -o prog prog.c
mpiicc -mmic -O2 -o prog.mic prog.c

Mind that mpicc / mpifort (i.e. one "i" missing in the command name) do not support the compilation for MIC.

Running Applications

Offload Mode

Offload regions are automatically offloaded to the attached Intel Xeon Phi coprocessors.


export H_TRACE=1

to get offloading diagnostics at runtime.

To get timings use:

export H_TIME=1

Native Mode

Run an interactive job (using sleep within a job script) and copy the binary compiled for MIC to the coprocessor using scp or copy it to the local filesystem /scratch/. Login to the Intel Xeon Phi coprocessor to get a shell and execute the binary.

For simple programs not needing any input files etc. native programs can be also launched from the host using

micnativeloadex micbinary

MPI Programs

MPI tasks can be run on the host and/or on the Xeon Phi coprocessors. The MPI code can contain Offload Pragmas for offloading to the coprocessors and/or OpenMP to implement multithreading.

Only Intel MPI is supported. Since Intel MPI tasks are launched via ssh on remote nodes / coprocessors (default on SuperMIC: I_MPI_HYDRA_BOOTSTRAP=ssh), users have to create passwordless ssh-keys named id_rsa and using "ssh-keygen". The content of has to be appended to the file $HOME/.ssh/authorized_keys. It is not allowed to copy the keys to other systems outside of SuperMUC/SuperMIC.

An overview of the various MPI/OpenMP modes is presented in the following picture.


MPI Programs: Offload Mode

Sample MPI program doing offloading:

#include <unistd.h>
#include <stdio.h>
#include <mpi.h>

int main (int argc, char* argv[]) {
  char hostname[100];
  int rank, size;
  MPI_Init (&argc, &argv);      /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size);        /* get number of processes */


  #pragma offload target(mic:rank%2) {
    char michostname[50];
    gethostname(michostname, sizeof(michostname));
    printf("MIC: I am %s and I have %ld logical cores. I was called by  process %d of %d: host: %s \n", michostname,
         sysconf(_SC_NPROCESSORS_ONLN), rank, size, hostname);
  printf( "Hello world from process %d of %d: host: %s\n", rank, size, hostname);
  return 0;


Sample Jobscript allocating 2 nodes and running 2 MPI processes on every compute node. Every MPI process offloads to another MIC coprocessor.

#@ wall_clock_limit = 01:00:00
#@ job_name =test
#@ job_type = parallel
#@ class = phi
#@ node = 2
#@ tasks_per_node= 2
#@ node_usage = not_shared
#@ initialdir = $(home)/jobs/
#@ output = test-$(jobid).out
#@ error = test-$(jobid).err
#@ notification=always
#@ queue

. /etc/profile
. /etc/profile.d/

export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u
mpiexec ./testmpioffload


Hello world from process 2 of 4: host: i01r13a06
Hello world from process 1 of 4: host: i01r13a07
Hello world from process 3 of 4: host: i01r13a06
Hello world from process 0 of 4: host: i01r13a07
MIC: I am i01r13a07-mic1 and I have 240 logical cores. I was called by  process 1 of 4: host: i01r13a07
MIC: I am i01r13a06-mic1 and I have 240 logical cores. I was called by  process 3 of 4: host: i01r13a06
MIC: I am i01r13a07-mic0 and I have 240 logical cores. I was called by  process 0 of 4: host: i01r13a07
MIC: I am i01r13a06-mic0 and I have 240 logical cores. I was called by  process 2 of 4: host: i01r13a06

MPI Programs: Native Mode

To run MPI tasks natively on the Intel Xeon Phi coprocessors I_MPI_MIC must be enabled (default on SuperMIC: I_MPI_MIC=enable). The preferred fabrics is using Infiniband via dapl (default on SuperMIC: I_MPI_FABRICS=shm:dapl). The preferred value of I_MPI_DAPL_PROVIDER_LIST depends on the MPI mode used (tasks on host and MIC vs. host and host etc). Recommendations will be provided here in near future. Currently all MPI modes work with setting I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u, but may not deliver max. bandwidth.

In some cases I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0 delivers better performance.

Below we show a sample jobscript allocating two nodes and running 1 MPI process per host and per MIC coprocessor.

The binary $bin_mic must have been compiled natively for MIC using "-mmic". It is copied to the local /scratch/ directory on every allocated compute-node by the script.

The binary $bin_host must have been compiled for Intel Xeon.

$taskspermic and $tasksperhost should be properly set (in the example both are set to 1).

#@ wall_clock_limit = 01:00:00
#@ job_name =TEST2
#@ job_type = parallel
#@ class = phi
#@ node = 2
#@ node_usage = not_shared
#@ initialdir = $(home)/jobs/
#@ output = test-$(jobid).out
#@ error = test-$(jobid).err
#@ notification=always
#@ queue

. /etc/profile
. /etc/profile.d/

export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u




for i in `cat $LOADL_HOSTFILE`;do
    host=`echo $i| cut -d- -f1`;
    hosts="$hosts $host";
    mics="$mics $mic0 $mic1"
    scp $bin_mic $host:/$dir_mic

    command="$command  -host $host-ib -n $tasksperhost $bin_host $arg : "
    command="$command  -host $mic0 -n $taskspermic $dir_mic/$bin_mic $arg : -host $mic1 -n $taskspermic $dir_mic/$bin_mic $arg";

    if test $numhosts  -lt $LOADL_TOTAL_TASKS; then
    command="$command : "

echo Hosts = $hosts
echo MICS = $mics
echo $command


Hosts = 01r13c16 i01r13a02
MICS = i01r13c16-mic0 i01r13c16-mic1 i01r13a02-mic0 i01r13a02-mic1

mpiexec -host i01r13c16-ib -n 1 ./testmpi-host : 
        -host i01r13c16-mic0 -n 1 /scratch/testmpi-mic : 
        -host i01r13c16-mic1 -n 1 /scratch/testmpi-mic : 
        -host i01r13a02-ib -n 1 ./testmpi-host : 
        -host i01r13a02-mic0 -n 1 /scratch/testmpi-mic : 
        -host i01r13a02-mic1 -n 1 /scratch/testmpi-mic

Hello world from process 3 of 6: host: i01r13a02
Hello world from process 0 of 6: host: i01r13c16
Hello world from process 5 of 6: host: i01r13a02-mic1
Hello world from process 4 of 6: host: i01r13a02-mic0
Hello world from process 2 of 6: host: i01r13c16-mic1
Hello world from process 1 of 6: host: i01r13c16-mic0

MPI Usage in Native Mode: Best Practices

When running MPI tasks on several hosts AND Xeon Phi coprocessors, several collective MPI functions like MPI Barriers do not return properly (cause deadlocks).

In this case set i.e.

export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u

More details can be found under

To improve the performance of MPI_Put operations use:


Make sure the following variables are not set for SuperMIC (these variables are currently set by intel.mpi for all systems except for SuperMIC):

  unset I_MPI_HYDRA_BOOTSTRAP_EXEC (makes MPI jobs block at start-up)

  unset I_MPI_HYDRA_BRANCH_COUNT (causes strange ssh authorization errors)

More details can be found under 

MKL Usage in Offload Mode

Some Intel MKL libraries include Automatic Offload (AO) extensions that enable computationally intensive MKL functions called in user code to benefit from the attached Intel Xeon Phi coprocessors automatically. Either you can offload computations automatically or use Compiler Assisted Offload. 

Compiler Assisted Offload

MKL functions can be inserted inside offload regions to run on the Intel Xeon Phi coprocessor when present. The programmer maintains explicit control of data transfers and remote execution, using compiler offload pragmas and directives. Directives control data movements. Example:

#pragma offload target(mic) \
            in(transa, transb, N, alpha, beta) \
            in(A:length(matrix_elements)) \
            in(B:length(matrix_elements)) \
            out(C:length(matrix_elements) alloc_if(0))
             { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

Automatic Offload (AO)

The user code can be easily compiled without any further modifications using the Intel compiler. No code change is required in order to offload calculations to an Intel Xeon Phi coprocessor. Automatic Offload automatically uses both the host and the Intel Xeon Phi coprocessor. It is enabled either by using the function mkl_mic_enable() or by setting the following environment variable


The MKL library then takes care of data transfers and execution management.

The user code can be easy compiled without any necessary modification using the Intel compiler:

icc -mkl -o prog prog.c

The work can be divided between the host and the coprocessors by using the function mkl_mic_set_workdivision(TARGET_TYPE, TARGET_NUMBER, WORK_RATIO). It specifies how much work each Intel Xeon Phi coprocessor or the host CPU should do. WORK_RATIO sets the fractional amount of work that the specified device should do, with 0.0 ≤ WORK_RATIO ≤ 1.0 (see the Intel MKL User's Guide for details under Further Documentation).

To see more information about the Automatic Offload and debugging please enable the environment variable:


This value can be set from 0 to 3, where a higher number means more debugging information.

Software Versions on the Coprocessors

There is a small module system installed also on the coprocessors.

Currently the following modules are available:

u65fok@i01r13a12-mic0:~$ module av  
------------ /lrz/mic/share/modules/files/tools ------------ 
------------ /lrz/mic/share/modules/files/compilers ------------ 
ccomp/intel/14.0            fortran/intel/14.0          
ccomp/intel/15.0(default)   fortran/intel/15.0(default) 
------------ /lrz/mic/share/modules/files/libraries ------------ 
------------ /lrz/mic/share/modules/files/parallel ------------      

Profiling Tools


liwkid (Lightweight performance tools) can be loaded interactively on the coprocessors using

module load likwid

For more details see the likwid home page.

Intel Amplifier XE

The new Sampling Enabling Product (SEP) kernel modules are now loaded on the compute nodes and MIC coprocessors per default. Only use module amplifier_xe/2016 and no older versions, since with incompatible versions the sep kernel module will often hang.

Since a lot of output is produced, always use /scratch as the output directory. Write access to /home is very slow.

To collect data, use the command line version "amplxe-cl" on the COMPUTE NODES. To visualise the collected data, use the GUI version "amplxe" on the LOGIN NODE.

To detect hotspots on the MIC run something similar to the following:

 # module load amplifier_xe/2016 
 # rm -rf /scratch/result 
 # amplxe-cl --collect advanced-hotspots -target-system=mic-native:`hostname`-mic0  -r /scratch/result -- /home/hpc/pr28fa/lu65fok/mic/program-for-mic

To use SEP directly, use a command like (use /scratch as output directory!)

  # module load amplifier_xe/2016
  # sep -start -mic -verbose -out /scratch/test  -app /usr/bin/ssh -args
  "lu65fok@i01r13c01-mic0 /home/lu65fok/micapp"

to sample on the MIC. To convert tb6 Files to amplxe and view within Amplifier XE use (on login node):

  # amplxe-cl -import test.tb6 -r test
  # amplxe-gui  test 

Limitations and Differences to SuperMUC

  • The SuperMUC filesystems $SCRATCH and $WORK are not mounted on SuperMIC.
  • Compilation is only possible on the SuperMIC and SuperMUC login nodes, not on on the compute nodes or on the coprocessors.
  • Only Intel compilers and Intel MPI are supported on the SuperMIC system. 
  • The modules (or their versions) loaded per default on SuperMIC might be different from SuperMUC.
  • There is a small module system available on the coprocessorsafter interactive login.
  • In contrast to the diskless SuperMUC nodes, SuperMIC nodes have a local disk attached and a local /scratch partition shared between one host and its 2 attached coprocessors.
  • /home/hpc/ and /lrz/ filesystems are shared by all coprocessors. Mind that the default home-directory /home/user on the MICs is a local one.

Further Documentation


James Reinders, James Jeffers, Intel Xeon Phi Coprocessor High Performance Programming, Morgan Kaufman Publ Inc, 2013

Rezaur Rahman: Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers, Apress 2013 .

Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, Colfax 2013,


Intel Developer Zone: Intel Xeon Phi Coprocessor,

Intel Many Integrated Core Architecture User Forum, .

Intel Xeon Phi Coprocessor Developer's Quick Start Guide ,

Best Pratice Guide - Intel Xeon Phi (and references therein),

Intel Xeon Phi Coprocessor System Software Developers Guide,

Intel Math Kernel Library Link Line Advisor,

Intel Math Kernel Library 11.0 Reference Manual, Intel MKL User's Guide