LRZ develops 10B language model - Leibniz-Rechenzentrum

Bavarian, German and English – in one large language model: To enable AI to understand Bavarian, the LRZ developed and trained an own language model.

Do-you want to translate “Hock di hera, samma mera”* into German or English? Now there's Llama-GENBA-10B. The trilingual language model is based on Meta's Large Language Model (LLM) Llama, version 3.1-8B, and was trained by researchers at LRZ and Cerebras Systems with 10 billion parameters using a dataset of 164 billion tokens. Llama-GENBA-10B is an inclusive and resource-efficient base model that not only translates but also generates texts in English, German and Bavarian. “Our model demonstrates efficient multilingual training on the Cerebras CS-2 system,” explains Michael Hoffmann from the LRZ Big Data & Artificial Intelligence (BDAI) team. “To train Llama GENBA 10B, the CS2 system consumed around 35 megawatt hours of energy in 66 days.”

The group has just published a paper (preprint) on the method and challenges of training Llama-GENBA-10B. It compares the model's performance with other language models, such as Apertus-8B, gemma-2-9b and EuroLLM-9B. “In addition to performance, it was important to us to work with non-English data and, above all, with a dialect,” says Jophin John from the LRZ BDAI team. Since most LLMs focus on English, Llama-GENBA-10B strengthens the preservation of less common languages and regional dialects. It thus provides a blueprint for similar models that even small research teams can implement.

* “Sit here, then there will be more of us.”