Deep Learning System Nvidia DGX-1 and OpenStack GPU VMs


The Deep Learning System DGX-1 is a “Supercomputer in a box” with a peak performance of 170 TFlop/s (FP16). It contains eight high end GPGPUs from NVIDIA (Tesla P100) with each 16 GB RAM and 28.672 CUDA-compute units which are connected to each other by a NVLink Interconnect and a host x86 compatible system with 40 cores (Intel Xeon). Users can reserve the whole DGX-1 exclusivly and run complex machine-learning tasks, which are available via Docker images.

The second generation of DGX-1 comes with 8 V100 GPU, for details we refer to Nvidia documentation.

One very important aspect in using DGX-1 is to take the advantage of mix-precision training, all deep learning related jobs running on DGX-1 should implement mix-precision training, you can find related documentations here:

LRZ has organized an HPC4AI workshop, videos of the workshop can be found here:
For those who want to reserve DGX-1 for longer periods, they have to state in their application that they have implemented the mixed precision training.

A set of preinstalled images covers deep learning toolkits such as TensorFlow, Theano, CNTK, Torch, DIGITS, Caffe and others.

The system is running Ubuntu 14.04 LTS in the version supported by NVidia.

Below is a scematic drawing about the internals of the system. The 8 GPUs are connected via NVlink High Speed Interconnect and the x86 cores are connected via PCIe Switches to the GPUs. For a detailed documentation about the hardware please see the Nvidia website directly.

Also available are 4 independent single node systems with one NVIDIA GPU P100 for development purposes. On these systems only a general purpose image is available which provides the PGI Compiler Suite and CUDA.


Access and Login

LRZ Linux Cluster Users (Here is how can you get an LRZ Linux Cluster account:, please note: your tum/lmu or lrz account is not an LRZ Linux Cluster account!!!) can get access to the system by submitting an Incident ticket (subject linux cluster) to the LRZ service desk ( Please briefly describe your research in your application (Why do you need GPU) and notice that DGX-1 is in high demand, and it is a production machine, meaning it should not be used as a debugging tool. Any software you want to run on DGX-1 should have been already tested with GPU elsewhere.  Currently all of LRZ GPU systems are in testing phase, there is no backup for data. You have to backup your data by yourself.

The system can be reserved via a online calendar system ( which is at the moment only available in the MWN. If you want to use the reservation system and the login to the compute system from outside the MWN you first have to connect to the LRZ VPN (see VPN documentation).

In the online calendar reservation the user can see the available timeslots and book the complete system for maximal 6 hours per slot. Longer reservations are possible if you can justify your requirement in an application (submit a ticket). We are working on a slurm solution so that users of GPU can submit their GPU batch jobs without reservation.

Please remember that on DGX-1 only /home/ is kept between sessions. On VMs **everything** is lost. We are working on that but in the meantime always copy your data.

booked reservation system

The user has then to upload a ssh key into the online calendar which will be used for authentication on the the system.

When the date of the reservation approaches, the user will obtain an email with further instructions how to connect to the system via ssh and via a http link. Please be aware that the users computer has to be in the Munich Science Network or LRZ VPN in order to connect to the system.

Here is a GitHub document on how to use LRZ Deep Learning systems,, courtesy of Stefan Schweter (

NVIDIA GPU Optimized Deep Learning Frameworks 

The NVIDIA Deep Learning SDK accelerates widely-used deep learning frameworks.

This release provides containerized versions of those frameworks optimized for the NVIDIA DGX-1, pre-built, tested, and ready to run, including all necessary dependencies. Together with Nvidia deep learning institute, LRZ regularly offers Deep Learning Workshops on site, please check LRZ workshop/course page.

Framework Base Version Container Name Description Release Notes
Caffe NVIDIA Caffe 0.16.4 Caffe 17.10

Caffe was originally developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. It is a deep learning framework made with expression, speed, and modularity in mind.

NVIDIA Caffe is an NVIDIA-maintained fork of BVLC Caffe tuned for NVIDIA GPUs, particularly in multi-GPU configurations, accelerated by the NVIDIA Deep Learning SDK. It includes multi-precision support as well as other NVIDIA-enhanced features and offers performance specially tuned for the NVIDIA DGX-1.

NVIDIAs Manual for this version is available here.

Caffe2 Caffe2 0.8.1 Caffe2 17.10

Caffe2 is a deep-learning framework designed to easily express all model types, for example, CNN, RNN, and more, in a friendly python-based API, and execute them using a highly efficiently C++ and CUDA back-end.

It allows a large amount of flexibility for the user to assemble their model, whether for inference or training, using combinations of high-level and expressive operations, before running through the same python interface allowing for easy visualization, or serializing the created model and directly using the underlying C++ implementation.

Caffe2 supports single and multi-GPU execution, along with support for multi-node execution.

CNTK CNTK 2.2 CNTK 17.10 Microsoft Cognitive Toolkit (CNTK) empowers you to harness the intelligence within massive datasets through deep learning by providing uncompromised scaling, speed and accuracy with commercial-grade quality and compatibility with the programming languages and algorithms you already use. Link
DIGITS DIGITS 6.0.0rc2 with Caffe 0.16.4 DIGITS 17.10 DIGITS can be used to rapidly train highly accurate deep neural network (DNNs) for image classification, segmentation and object detection tasks. DIGITS simplifies common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best performing model from the results browser for deployment. DIGITS is completely interactive so that data scientists can focus on designing and training networks rather than programming and debugging. Link
MXNet MXNet 0.11.0 MXNet 17.10

MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix the flavors of symbolic programming and imperative programming to maximize efficiency and productivity. 

In its core is a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. The library is portable and lightweight, and it scales to multiple GPUs and multiple machines. 

MXNet is also more than a deep learning project. It is also a collection of blueprints and guidelines for building deep learning systems and interesting insights of DL systems for hackers.

PyTorch PyTorch v0.2 PyTorch 17.10 PyTorch is a python package that provides two high-level features:
  • Tensor computation (like numpy) with strong GPU acceleration
  • Deep Neural Networks built on a tape-based autograd system
You can reuse your favorite python packages such as numpy, scipy and Cython to extend PyTorch when needed.
Tensorflow Tensorflow 1.3.0

Tensorflow 17.10

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code. 

TensorFlow was originally developed by researchers and engineers working on the Google Brain team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research. The system is general enough to be applicable in a wide variety of other domains, as well.

Theano Theano 0.10beta3 Theano 17.10 Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Link
Torch Torch7 Torch 17.10

Torch is a scientific computing framework with wide support for deep learning algorithms. Thanks to an easy and fast scripting language, Lua, and an underlying C/CUDA implementation, Torch is easy to use and is efficient. 

Torch offers popular neural network and optimization libraries that are easy to use yet provide maximum flexibility to build complex neural network topologies.


General purpose container

CUDA and OpenACC compiler ("CUDA 9 and PGI 17.10")

This is a basic container for software development containing the following components:

  • Ubuntu 16.04
  • NVIDIA CUDA® 9.0.176
  • NVIDIA cuDNN 7.0.3
  • NVIDIA NCCL 2.0.5 (optimized for NVLink)
  • PGI C++ and Fortran Compiler 17.4
  • OpenMPI with GPUDirect

Using nvidia-docker on single GPU virtual servers

As single GPU virtual servers are deleted completely after each session a quick way to install your software is helpful. The fastest way to run a machine learning application on the single GPU system is to use nvidia-docker:
Example Tensorflow:
Log in to virtual server using
ssh -L 8888:localhost:8888 ubuntu@<your ip address from setup email>

Pull and start the container:

nvidia-docker run -it -p 8888:8888

Then open http://localhost:8888 in your browser.

You can use the option "-v /ssdtemp:/ssdtemp" to map the 800 GB of storage space mounten on /ssdtemp in the server to the container.
A list of Tensorflow containers is available here:
You can of course package your own applications using nvidia-docker, see


11. January 2018

Single GPUs updated to CUDA 9.1 and PGI 17.9

2. November 2017

Updated all containers to 17.10 - old containers for MXnet, CUDA, Tensorflow and Caffe are still available. 17.10 is basically only a minor update to CUDA, cuDNN and NCCL and includes latest OS updates.