LLM cost reduction: a loss landscape point of view

https://ia.loria.fr/talk

Christophe Cerisara, chercheur CNRS: cerisara@loria.fr

Nos productions de LLM en 2025:

LLM Loss landscape

  • = plot of the Empirical risk function (sum of the errors on the dataset) as a function of model’s parameters

Large dim: all local minima are global; linear connectivity

LLM pruning / low-rank compression

  • Iterative Magnitude Pruning theoretically leads to high compression rates
  • Iterative Magnitude Pruning requires retraining

Lottery Ticket Hypothesis

  • Lottery Ticket Hypothesis:
    • Each neural network contains a sub-network (winning ticket) that, if trained again in isolation, matches the performance of the full model.

  • Retraining after pruning may not be required with ultra-flat valley
  • But vanilla SGD converges towards edge of valleys: edge of stability
  • May be mitigated with Stochastic Weight Averaging

Our contribution: NAACL’25

  • Low-rank pruning method
  • Reduce costs of retraining:
    • ShearedLlama needs 52B tokens
    • Minitron needs 94B tokens
    • Llillama needs 0.013B tokens
  • Results:
    • Compress Mixtral-48B, Gemma-27B on 1xA100
    • Good results with Phi3-14B, Phi2-3B, Mistral-7B
    • Mixtral-48b can run on 1xA100 with 2048-context & batch=4
    • Compress Mamba-3B, FalconMamba-7B, Whisper-med
  • Is there still any free space in LLM matrices? (parameter-efficiency)
  • Otherwise, we may not need all this information at test time
  • Pruning: remove “unused” or “superfluous” dimensions
  • Metric to measure “emptiness”: matrix rank

LLM matrices are nearly full rank

But activations are low rank

  • Principle: find a low-rank matrix that minimizes reconstruction error: \[\widehat{\Delta W} = \underset{{\Delta W}}{\mathrm{argmin}} \;\; \frac{1}{N}\sum\limits_{x \in \mathcal{D}}\|Wx - {\Delta Wx}\|_{F}\]
  • Solution (only for matrices): \[\Sigma = \underset{y \in \mathcal{Y}}{\mathbb{E}}\left[yy^T\right] - \mathbb{E}[y]\mathbb{E}[y]^T ~~~ \Sigma = USU^T ~~~ A=U, B=U^TW\]
  • LORD (Kaushal,2023)
  • Our contributions:
    • Generalize to non-linear layers
      • Linear algebra \(\rightarrow\) Feature Distillation
    • Tunable compromise local vs. global optimization
      • Local \(\rightarrow\) Flexible semi-global
    • Improved distillation
      • Teacher-only \(\rightarrow\) Teacher & Student supervision
    • Low-cost algo: bottom-first compression
  • Contribution: Better teacher/student inputs compromise

  • Evidence: deeper layers are more robust to compression:

  • Bottom-first compression:
    • Low memory requirements:
      • Compress layers 1 by 1
      • No backprop
    • Low computational cost & sample-efficient:
      • Partial forward pass
      • SVD init: reduce data reqs

Growing nets

Adding dimensions 1 by 1 during training

  • Progressive nets, dynamic architectures, NAS
  • Theorem: Growing during training leads to flatter minima
  • Intuition of the proof:
    • Consider full space \(\Omega\) frozen, except an hyperplane H (initial net)
    • Derive the volume \(V\) of basins of attractions around minima \(\theta^*\) in \(\Omega\): flatter \(\theta^* \implies\) larger \(V\)
    • Derive probability that \(H\) intersects \(V\): larger \(V \implies\) higher proba
    • Lemma: SGD more likely to converge towards large \(V\)
    • Prove that larger \(V \implies\) smaller Hessian
  • Experimental results on CIFAR-100 with ResNet:
    • \(\lambda_{\text{max}}\) decreases from 800 to 775
    • Test acc stables: from 60.6% to 60.9%
  • Contribute to the debate about flatness / generalization
    • More and more indices converge towards decoupling flatness and generalization

Parameter Efficient Finetuning

  • Additive PEFT methods increase dimensionality, but not LoRA, sparseFT…
  • Most PEFT methods reduce costs:
    • do not store LLM gradients
    • do not store optimizer momentum
    • but still requires backprop in LLM!

  • Our contribution: scaling laws for LST
  • test loss scaling law:

\[ \mathcal{L}(C) = \frac{546}{C^{0.26}} + 0.21\]

  • downstream accuracy scaling law:

  • memory scaling law:

Unsupervised risk training

  • So far, optimize empirical loss landscape
  • We really want population risk \(R(\theta) = \int_{X,Y} L(f_{\theta}(x),y)dxdy\)
  • Assume unlabeled training data, 2 classes
  • Assume conditional distribution of classification score is Gaussian:

\[p(f_{\theta}(x)|y) \sim N(\mu_y, \sigma_y)\]

  • 10k epochs of training on WDBC dataset (from PMLB)

  • We derive:

\[R(\theta) \simeq \frac {p(y=0)} 2 (1+\mu_0) \left( 1-erf\left( \frac{-1-\mu_0}{\sigma_0\sqrt 2} \right)\right) +\\ p(y=0)\sigma_0^2N(-1;\mu_0,\sigma_0) +\\ \frac {p(y=1)} 2 (1-\mu_1) \left( 1+erf\left( \frac{1-\mu_1}{\sigma_1\sqrt 2} \right)\right) + \\ p(y=1)\sigma_1^2 N(1;\mu_1,\sigma_1) \]

  • This closed-form risk estimation can be used to:
    • train without supervised labels, when class priors known
    • post-train a DNN to improve generalization
    • regularize empirical risk
    • select models without validation corpus, replacing early stopping

MetaEval Offensive talk detection:

Thank you!

cerisara@loria.fr

https://ia.loria.fr/