LLM cost reduction: a loss landscape point of view
https://ia.loria.fr/talk
Christophe Cerisara, chercheur CNRS: cerisara@loria.fr
![]()
Nos productions de LLM en 2025:
- 1 LLM pretrain sur 5T tokens: Lucie-7b
- 1 LLM post-trained: MillePensées sur 3b tokens: Mille-Pensées
- 5 LLMs finetuned:
LLM Loss landscape
- = plot of the Empirical risk function (sum of the errors on the
dataset) as a function of model’s parameters
![]()
Large dim: all local minima are global; linear connectivity
![]()
LLM pruning / low-rank
compression
![]()
- Iterative Magnitude Pruning theoretically leads to high compression
rates
- Iterative Magnitude Pruning requires retraining
Lottery Ticket Hypothesis
- Lottery Ticket
Hypothesis:
- Each neural network contains a sub-network (winning ticket) that, if
trained again in isolation, matches the performance of the full
model.
![]()
- Retraining after pruning may not be required with ultra-flat
valley
- But vanilla SGD converges towards edge of valleys: edge of stability
- May be mitigated with Stochastic Weight Averaging
Our contribution: NAACL’25
- Low-rank pruning method
- Reduce costs of retraining:
- ShearedLlama needs 52B tokens
- Minitron needs 94B tokens
- Llillama needs 0.013B tokens
- Results:
- Compress Mixtral-48B, Gemma-27B on 1xA100
- Good results with Phi3-14B, Phi2-3B, Mistral-7B
- Mixtral-48b can run on 1xA100 with 2048-context & batch=4
- Compress Mamba-3B, FalconMamba-7B, Whisper-med
- Is there still any free space in LLM matrices?
(parameter-efficiency)
- Otherwise, we may not need all this information at test time
- Pruning: remove “unused” or “superfluous” dimensions
- Metric to measure “emptiness”: matrix rank
LLM matrices are nearly full
rank
![]()
But activations are low rank
![]()
- Principle: find a low-rank matrix that minimizes reconstruction
error: \[\widehat{\Delta W} =
\underset{{\Delta W}}{\mathrm{argmin}} \;\; \frac{1}{N}\sum\limits_{x
\in \mathcal{D}}\|Wx - {\Delta Wx}\|_{F}\]
- Solution (only for matrices): \[\Sigma =
\underset{y \in \mathcal{Y}}{\mathbb{E}}\left[yy^T\right] -
\mathbb{E}[y]\mathbb{E}[y]^T ~~~ \Sigma = USU^T ~~~ A=U,
B=U^TW\]
- LORD (Kaushal,2023)
- Our contributions:
- Generalize to non-linear layers
- Linear algebra \(\rightarrow\)
Feature Distillation
- Tunable compromise local vs. global optimization
- Local \(\rightarrow\) Flexible
semi-global
- Improved distillation
- Teacher-only \(\rightarrow\)
Teacher & Student supervision
- Low-cost algo: bottom-first compression
- Contribution: Better teacher/student inputs
compromise
![]()
- Evidence: deeper layers are more robust to compression:
![]()
- Bottom-first compression:
- Low memory requirements:
- Compress layers 1 by 1
- No backprop
- Low computational cost & sample-efficient:
- Partial forward pass
- SVD init: reduce data reqs
Growing nets
![]()
Adding dimensions 1 by 1 during training
- Progressive nets, dynamic architectures, NAS
- Theorem: Growing during training leads to flatter minima
- Intuition of the proof:
- Consider full space \(\Omega\)
frozen, except an hyperplane H (initial net)
- Derive the volume \(V\) of basins
of attractions around minima \(\theta^*\) in \(\Omega\): flatter \(\theta^* \implies\) larger \(V\)
- Derive probability that \(H\)
intersects \(V\): larger \(V \implies\) higher proba
- Lemma: SGD more likely to converge towards large \(V\)
- Prove that larger \(V \implies\)
smaller Hessian
- Experimental results on CIFAR-100 with ResNet:
- \(\lambda_{\text{max}}\) decreases
from 800 to 775
- Test acc stables: from 60.6% to 60.9%
- Contribute to the debate about flatness / generalization
- More and more indices converge towards decoupling flatness and
generalization
Parameter Efficient
Finetuning
![]()
- Additive PEFT methods increase dimensionality, but not LoRA,
sparseFT…
- Most PEFT methods reduce costs:
- do not store LLM gradients
- do not store optimizer momentum
- but still requires backprop in LLM!
- Our contribution: scaling laws for LST
- test loss scaling law:
![]()
\[ \mathcal{L}(C) =
\frac{546}{C^{0.26}} + 0.21\]
- downstream accuracy scaling law:
![]()
Unsupervised risk training
- So far, optimize empirical loss landscape
- We really want population risk \(R(\theta) = \int_{X,Y}
L(f_{\theta}(x),y)dxdy\)
- Assume unlabeled training data, 2 classes
- Assume conditional distribution of classification score is
Gaussian:
\[p(f_{\theta}(x)|y) \sim N(\mu_y,
\sigma_y)\]
- 10k epochs of training on WDBC dataset (from PMLB)
![]()
\[R(\theta) \simeq
\frac {p(y=0)} 2 (1+\mu_0) \left( 1-erf\left(
\frac{-1-\mu_0}{\sigma_0\sqrt 2} \right)\right) +\\
p(y=0)\sigma_0^2N(-1;\mu_0,\sigma_0) +\\
\frac {p(y=1)} 2 (1-\mu_1) \left( 1+erf\left(
\frac{1-\mu_1}{\sigma_1\sqrt 2} \right)\right) + \\
p(y=1)\sigma_1^2 N(1;\mu_1,\sigma_1)
\]
- This closed-form risk estimation can be used to:
- train without supervised labels, when class priors known
- post-train a DNN to improve generalization
- regularize empirical risk
- select models without validation corpus, replacing early
stopping
MetaEval Offensive talk detection:
![]()
Thank you!
cerisara@loria.fr
https://ia.loria.fr/