Cost is Key: towards accessible LLMs
Christophe Cerisara (CNRS)
News: Mille-Pensées
Contributors: Gabriel Lauzzana (main), Imane Ouada
- Good LLM 7b that reasons in French
- Better post-training pipeline for maths than Qwen2.5-Maths
- Trained on translated maths CoT (50%) and EN maths CoT (50%): 1.6b tokens
Available on HuggingFace: https://huggingface.co/GLauzza/Mille-Pensees
- Reasons in FR, which improves its performance in Maths-FR:
![]()
- Also improves the performance in Maths-EN as well!
![]()
- Does not forget (much) generic EN capabilities:
![]()
The quest for lowering costs
- Scaling laws teach us that more compute always improve performance

Revised Chinchilla: \(L = 1.82 + \frac {514}{N^{0.35}} + \frac {2115.2}{D^{0.37}}\)
Test-time compute scaling laws:
![]()
- But performance is only one side of the coin: costs must be minimized for usability, ecology…
- Costs should become a metric as important as performance
HF leaderboard: AIEnergyScore ![]()
comparia.beta.gouv.fr ![]()
Capability density:
![]()
\(p(M) = \frac {f^{-1}(S)}{N}\) with \(f=\) downstream scaling law
![]()
Exponential growth of capability density
![]()
Counter-intuitively, compression often lowers density
Our contribution: back-propagation free adaptation
Main contributor: Estelle Zheng
- Training an LLM with \(N\) parameters:
| Forward |
\(O(2N)\) |
\(N/2\) |
| Backward |
\(O(4N)\) |
\(11N\) |
Ladder side-tuning

May ladder scale?

May ladder reason?
![]()
May x-ladder shorten CoT?

Thank you!
cerisara@loria.fr