

Why does training AI models consume so much energy?
Prof. Felix Dietrich: The larger AI models are — as both research and practical experience confirm — the better they perform, but they also consume significantly more energy during training. Artificial neural networks (ANNs) are composed of interconnected elements, the neurons, which multiply input signals by predefined parameters, sum-up everything, and then pass the result to all connected elements in the next layer of the network. ANNs are mainly trained iteratively — meaning the parameters are slightly adjusted step by step so that the network becomes increasingly better at predicting the desired outcome or generating a more meaningful forecast. These training processes consume a lot of energy because they require many individual steps or adjustments. The power consumption increases even more due to an additional loop — the search for the optimal network architecture, known as Neural Architecture Search (NAS), or configuring the network. The complexity of this search grows with the size of the network and the complexity of the problem being solved. For instance, if the number of neurons or hidden layers is misconfigured, the network may be trained — and thus consumes power — even though it ultimately doesn’t perform well after training.
researches numerical algorithms for AI methods. Dietrich studied Scientific Computing and completed his master's degree and doctorate at TUM. After working in the USA, he took over an Emmy Noether junior research group in 2022 and was appointed Professor of Physics-Enhanced Machine Learning by TUM in 2024.
How did you approach redesigning energy-intensive training steps?
Dietrich: We began by asking some fundamental questions: Why do neural networks work, and which parameters ultimately determine their predictions? The parameters of an AI model and their specific roles are still not fully understood. But if – that was our core idea – the role of each parameter were clearly known, then we could calculate them directly from the data and potentially bypass both the search for the right configuration and thousands of training steps. We turned to probability theory because, currently, the parameters of a network are often initialized randomly before training – in principle, the training starts with a random guess. When several of these random predictions are then simply combined in a linear way, it’s known as a Random Feature Model. A comparable random approach – Randomized Linear Algebra – has proven useful in traditional scientific computing: in linear algebra calculations, numbers are arranged in rows and columns, but depending on the problem, some of these matrices are so large that computers cannot store them. In such cases, a trick is used: only the products of the matrix with randomly chosen vectors are stored, not the entire matrix. This became the foundation for our method.
And how does your method work?
Dietrich: I’ll start with a simplified example and then explain the general concept. Suppose a company wants to know whether a product will sell better or worse next month compared to the current one – based on its price and the number of units sold last month. This is a classification problem with two classes: “sells better” or A, and “sells worse” or B, and two input variables: price and quantity. A neural network is to be trained to represent this function. It receives price and quantity as inputs and returns either A or B as output. The training dataset consists of 100 to 1,000 input samples along with the corresponding real sales results. Now, how do we find the network’s parameters using our method? First, we select pairs of inputs from the dataset where one belongs to class A and the other to class B – so each pair has different output classes. Then we evaluate these pairs: the more similar the inputs are, the higher we set the probability that we will use that pair to calculate a parameter in the network. The reason is that prediction is generally harder when inputs are similar but outputs differ – as is the case here. After evaluating many such input pairs, we randomly select some of them, based on the computed probabilities, to be converted into parameters for neurons. Once all the neurons are constructed, we solve a simple linear system of equations to compute the linear combination of all neurons that produces the final output – that is, class A or B.
Okay, and how can this be generalised?
Dietrich: We replace iterative training steps with probabilistic calculations and specifically look for values in the datasets that change strongly and quickly with changes in parameters. The main idea is to evaluate nonlinear parameters based on the content of the data and the problem to be solved, and then describe and solve the linear part – the final layer of the network – using a linear system of equations. In commonly used iterative training, the network has to be applied to the training dataset hundreds or even thousands of times, each time adjusting its parameters. With our method, we calculate the nonlinear parameters in one pass through the dataset and then solve a linear problem. So instead of 1,000 steps, we only need two – drastically reducing power consumption. This approach works for simple network architectures like Feedforward or Recurrent Networks, which are primarily used in machine learning for tabular and time-series data, and now also for graph models used to process linked data on graphs. While our method requires slightly more neurons, its accuracy is comparable to iterative training.
Who benefits from your research?
Dietrich: At the moment, researchers or companies – for example, in the financial industry – that use classical machine learning and train small neural networks on tabular data, or that use AI models to summarise data and vary them for predictions. Researchers who train networks with simulation data to create surrogate models can also reach their goal more quickly. To support the development of new AI systems or even generative AI with mathematical methods, we’re currently looking for mathematical solutions for convolutional layers, which specialise in processing images, and for attention layers, which are mainly used in large, generative language models. Once we succeed there, we can develop solutions for transformer networks, which currently form the foundation of text-based generative AI.
Where are the limits of parameter calculation?
Dietrich: Since we select the parameters to be calculated randomly, we need more neurons. That means the networks we train are somewhat larger than those obtained through iterative training. As a result, networks trained with our method require more energy during operation, but only because we haven’t yet explored existing methods for compressing trained networks. Another limitation is that we haven’t yet developed our method for all types of network architectures used in current AI systems – unfortunately, especially not for those used in energy-intensive, complex models like generative AI. However, I don’t believe there’s a conceptual barrier, and we already have some initial ideas and promising tests to extend the method.
And what advice do you have for researchers working with AI models and generative AI — how can they analyse their data more efficiently?
Dietrich: There are now many pre-trained AI models available for a wide range of applications. So my recommendation is: don’t reinvent the wheel – meaning, don’t constantly retrain models from scratch – but instead build on existing models. Even for new, custom datasets, these models can be fine-tuned and adapted. In many cases, building and training a model is far too time-consuming, especially since researchers and companies are focused on their domain-specific questions, not on training AI models. (vs | LRZ)