Cost is Key: towards accessible LLMs

Christophe Cerisara (CNRS)

News: Mille-Pensées

Contributors: Gabriel Lauzzana (main), Imane Ouada

  • Good LLM 7b that reasons in French
  • Better post-training pipeline for maths than Qwen2.5-Maths
  • Trained on translated maths CoT (50%) and EN maths CoT (50%): 1.6b tokens

Available on HuggingFace: https://huggingface.co/GLauzza/Mille-Pensees

  • Training dataset

  • Reasons in FR, which improves its performance in Maths-FR:
  • Also improves the performance in Maths-EN as well!
  • Does not forget (much) generic EN capabilities:

The quest for lowering costs

  • Scaling laws teach us that more compute always improve performance

Revised Chinchilla: \(L = 1.82 + \frac {514}{N^{0.35}} + \frac {2115.2}{D^{0.37}}\)

Test-time compute scaling laws:

  • But performance is only one side of the coin: costs must be minimized for usability, ecology…
  • Costs should become a metric as important as performance

HF leaderboard: AIEnergyScore

comparia.beta.gouv.fr

Capability density:

\(p(M) = \frac {f^{-1}(S)}{N}\) with \(f=\) downstream scaling law

Exponential growth of capability density

Counter-intuitively, compression often lowers density

Our contribution: back-propagation free adaptation

Main contributor: Estelle Zheng

  • Training an LLM with \(N\) parameters:
FLOPs per token Memory
Forward \(O(2N)\) \(N/2\)
Backward \(O(4N)\) \(11N\)

Ladder side-tuning

May ladder scale?

May ladder reason?

May x-ladder shorten CoT?

Thank you!

cerisara@loria.fr