LLM cost reduction: a loss landscape point of view

https://ia.loria.fr/talk

Christophe Cerisara, chercheur CNRS: cerisara@loria.fr

Nos productions de LLM en 2025:

1 LLM pretrain sur 5T tokens: Lucie-7b
1 LLM post-trained: MillePensées sur 3b tokens: Mille-Pensées
5 LLMs finetuned:
- SpeechLLM for Wolof
- Triage des urgences médicales: Llama-ER-8b
- BaldWhisper for Bambara
- BaldGemma3: 140M-parameters compressed Whisper
- French News Qwen3-8b-frnews

LLM Loss landscape

= plot of the Empirical risk function (sum of the errors on the dataset) as a function of model’s parameters

Large dim: all local minima are global; linear connectivity

LLM pruning / low-rank compression

Iterative Magnitude Pruning theoretically leads to high compression rates
Iterative Magnitude Pruning requires retraining

Lottery Ticket Hypothesis

Lottery Ticket Hypothesis:
- Each neural network contains a sub-network (winning ticket) that, if trained again in isolation, matches the performance of the full model.

Retraining after pruning may not be required with ultra-flat valley
But vanilla SGD converges towards edge of valleys: edge of stability
May be mitigated with Stochastic Weight Averaging

Our contribution: NAACL’25

Low-rank pruning method
Reduce costs of retraining:
- ShearedLlama needs 52B tokens
- Minitron needs 94B tokens
- Llillama needs 0.013B tokens
Results:
- Compress Mixtral-48B, Gemma-27B on 1xA100
- Good results with Phi3-14B, Phi2-3B, Mistral-7B
- Mixtral-48b can run on 1xA100 with 2048-context & batch=4
- Compress Mamba-3B, FalconMamba-7B, Whisper-med

Is there still any free space in LLM matrices? (parameter-efficiency)
Otherwise, we may not need all this information at test time
Pruning: remove “unused” or “superfluous” dimensions
Metric to measure “emptiness”: matrix rank

LLM matrices are nearly full rank

But activations are low rank

Principle: find a low-rank matrix that minimizes reconstruction error: \[\widehat{\Delta W} = \underset{{\Delta W}}{\mathrm{argmin}} \;\; \frac{1}{N}\sum\limits_{x \in \mathcal{D}}\|Wx - {\Delta Wx}\|_{F}\]
Solution (only for matrices): \[\Sigma = \underset{y \in \mathcal{Y}}{\mathbb{E}}\left[yy^T\right] - \mathbb{E}[y]\mathbb{E}[y]^T ~~~ \Sigma = USU^T ~~~ A=U, B=U^TW\]
LORD (Kaushal,2023)

Our contributions:
- Generalize to non-linear layers
  - Linear algebra \(\rightarrow\) Feature Distillation
- Tunable compromise local vs. global optimization
  - Local \(\rightarrow\) Flexible semi-global
- Improved distillation
  - Teacher-only \(\rightarrow\) Teacher & Student supervision
- Low-cost algo: bottom-first compression

Contribution: Better teacher/student inputs compromise

Evidence: deeper layers are more robust to compression:

Bottom-first compression:
- Low memory requirements:
  - Compress layers 1 by 1
  - No backprop
- Low computational cost & sample-efficient:
  - Partial forward pass
  - SVD init: reduce data reqs

Growing nets

Adding dimensions 1 by 1 during training

Progressive nets, dynamic architectures, NAS
Theorem: Growing during training leads to flatter minima
Intuition of the proof:
- Consider full space \(\Omega\) frozen, except an hyperplane H (initial net)
- Derive the volume \(V\) of basins of attractions around minima \(\theta^*\) in \(\Omega\): flatter \(\theta^* \implies\) larger \(V\)
- Derive probability that \(H\) intersects \(V\): larger \(V \implies\) higher proba
- Lemma: SGD more likely to converge towards large \(V\)
- Prove that larger \(V \implies\) smaller Hessian

Experimental results on CIFAR-100 with ResNet:
- \(\lambda_{\text{max}}\) decreases from 800 to 775
- Test acc stables: from 60.6% to 60.9%
Contribute to the debate about flatness / generalization
- More and more indices converge towards decoupling flatness and generalization

Parameter Efficient Finetuning

Additive PEFT methods increase dimensionality, but not LoRA, sparseFT…

Most PEFT methods reduce costs:
- do not store LLM gradients
- do not store optimizer momentum
- but still requires backprop in LLM!

Our contribution: scaling laws for LST
test loss scaling law:

\[ \mathcal{L}(C) = \frac{546}{C^{0.26}} + 0.21\]

downstream accuracy scaling law:

memory scaling law:

Unsupervised risk training

So far, optimize empirical loss landscape
We really want population risk \(R(\theta) = \int_{X,Y} L(f_{\theta}(x),y)dxdy\)
Assume unlabeled training data, 2 classes
Assume conditional distribution of classification score is Gaussian:

\[p(f_{\theta}(x)|y) \sim N(\mu_y, \sigma_y)\]

10k epochs of training on WDBC dataset (from PMLB)

We derive:

\[R(\theta) \simeq \frac {p(y=0)} 2 (1+\mu_0) \left( 1-erf\left( \frac{-1-\mu_0}{\sigma_0\sqrt 2} \right)\right) +\\ p(y=0)\sigma_0^2N(-1;\mu_0,\sigma_0) +\\ \frac {p(y=1)} 2 (1-\mu_1) \left( 1+erf\left( \frac{1-\mu_1}{\sigma_1\sqrt 2} \right)\right) + \\ p(y=1)\sigma_1^2 N(1;\mu_1,\sigma_1) \]

This closed-form risk estimation can be used to:
- train without supervised labels, when class priors known
- post-train a DNN to improve generalization
- regularize empirical risk
- select models without validation corpus, replacing early stopping

MetaEval Offensive talk detection:

Thank you!

cerisara@loria.fr

https://ia.loria.fr/