Train Language Models with SuperMUC-NG

Training language models on a supercomputer? That’s possible on SuperMUC-NG Phase 2: A few recommendations for training language models.

Even though most of them now rely on graphics processing units (GPUs), training language models on supercomputers is not a given. “Developers of frontier large language models (LLMs) naturally train on high-performance computing systems that are specifically built and optimised for such workloads,” explains Ajay Navilarekal Rajgopal, a computer scientist in the Big Data & AI team of the Leibniz Supercomputing Centre (LRZ). “Supercomputers used for research and science, by contrast, cover a much broader range of applications.” SuperMUC-NG Phase 2 (SNG-2), LRZ’s current flagship system, runs simulations across a wide range of scientific disciplines, solves complex equations — and can also be used to train language models. “Most GPT models have been trained in NVIDIA’s CUDA environment,” Rajgopal notes. “To implement them on SNG-2, which uses Intel accelerators, only minor software adjustments are required.”

Together with computational linguist Nikolai Solmsdorf, a software engineer at Intel, Rajgopal evaluated the training setup of GPT-style models focusing on GPU utilisation efficiency: “We are able to run the training of GPT-style models on nearly half of the system without significant performance losses. That is already a very good result.” The two researchers will present their findings at ISC.

Distributing Models and Data across the System

Generative pre-trained transformers (GPT) form the foundation of today’s large language models (LLM). These are built in layers of individual neural networks that collectively account for hundreds of billions of parameters, supplemented by transformer mechanisms—mathematical attention processes that mimic aspects of how the human brain handles language. These systems analyse massive amounts of text and words in order to generate words, sentences, and texts based on statistical probabilities.

Rajgopal and Solmsdorf tested the LLM training setup on SNG-2 with GPT-style models of varying sizes with 3.6 billion, 20 billion, or 175 billion parameters. For comparison, ChatGPT (version 5) is estimated to have between two and five trillion parameters. These scales illustrate that LLMs and their training data must be distributed and scaled across multiple GPUs and compute nodes.
Several strategies exist for doing so, each with different impacts on memory and network resources:

Tensor parallelism: Split individual weight matrices across GPUs; great for huge layers, low memory per device but very high communication overhead.
Pipeline parallelism: Split layers sequentially across the accelerators; moderate memory per device and low communication, but pipeline bubbles waste compute and latency spikes with more stages.
Data parallelism: Replicate the full model on more GPUs; highest memory per device but least communication overhead.

The three Dimensions of Parallelism

It's all in the mix: When training language models, the models and datasets can be distributed across the GPUs in various ways. This, in turn, affects either the computational power or the data transfer between the processing and storage units. The study explored these different approaches. Automated distribution proved to be the most efficient. Image generated with NotebookLM by © A. N. Rajgopal | LRZ

Automated GPU efficiency optimisation

“If layers for tensor parallelism have to be distributed across multiple compute nodes on SNG-2, some loss in GPU efficiency is to be expected,” Rajgopal explains. “With pipeline parallelism, this mainly occurs when training datasets are grouped into larger batches. It is more efficient in this approach to divide datasets into many smaller chunks.” The degree of distribution of model layers and datasets can be controlled manually with appropriate software on SNG-2. However, automating this process using hyperparameter optimisation tools proves more efficient — an approach also examined by the two AI specialists.

These practical insights into training language models on SNG-2 and the study about it are likely to interest researchers who have so far relied on NVIDIA or AMD resources and are looking to explore alternatives. Even linguists who develop their own models should be interested to read them. Whether they can be transferred to the training of other AI models is something Rajgopal and his colleagues at LRZ will test and analyse in the future. “We also plan,” the computer scientist adds, “to train a model end-to-end, to extend our study to even larger GPT models and additional distribution strategies, while also taking a closer look at energy consumption and various workflow aspects.” vs | LRZ

SNG-2: Setup for training

– 128 nodes with 512 Intel Data Centre GPU Max 1550s from SuperMUC-NG, Phase 2,
– the Megatron Deep Speed and Deep Speed-Zero libraries,
– PyTorch 2.8.0 and its Intel extension.

All experiments were carried out in the LRZ’s standard software environment and using software stacks without custom kernels or framework modifications.