Welcome to LLM4All

Open-source LLMs are slowly catching up with ChatGPT. Even though they still lag behind in several key aspects (performance level on a wide range of various tasks, context length...), we expect the moat between open-source and closed-source LLMs to continue shrinking in the near future. Open-source LLMs are also key enabling technologies for research (theorizing training algorithms requires a complete control over them), companies (for privacy, technology ownership), governments (sovereignty and dependence) and the large community of individual practitioners who have already enriched the open-source LLM landscape at an unprecedented pace.

We focus in this project on such open-source LLMs that:

  • Everyone, preferably with some GPU resources, can deploy in her/his own computers and totally control.
  • Will be finetuned to better handle human meetings and conversations (but no chatbots!), especially in French.
  • Will be incrementally updated with the last pieces of news, emerging lexicon, events.
  • Will be connected to the best speech recognition models (Whisper, MMS...) to handle in particular emergency calls in hospitals.

Funding

LLM4All is a project funded by the French ANR (Agence Nationale de la Recherche).

Consortium

The consortium is composed of, in alphabetical order:

with a strong support from the Hugginface company on LLM training.

Planning

  • Start date: Oct 1st, 2023
  • Duration: 42 months

Companion projects

Work packages

Nb Leader Name
WP0 LORIA Project management
WP1 LORIA Fine-tuning, continual updating
WP2 LIX Low-cost LLMs
WP3 Linagora LLMs for spoken dialogue
WP4 AP-HP Boosting LLMs with other data
WP5 Linagora Communication, dissemination, exploitation

Contact

cerisara at loria dot fr


Deliverables

  • T0+6 = 1st April 2024
  • T0+12 = 1st October 2024
D date desc
1/4/24 DMP
1/10/24 Accord de consortium
1/4/25 Rapport intermediaire a 18 mois
1/10/25 DMP a 18 mois
31/3/27 Rapport final
31/3/27 DMP final
0.1 18 progress report v1
0.1 40 progress report v2
0.2 6 DMP v1
0.2 42 DMP v2
1.1 12 software LLM training and evaluation + report
1.1 24 software LLM training and evaluation + report
1.1 40 software LLM training and evaluation + report
1.2 24 model + release every 2 months
2.1 12 software LLM low-cost inference, training + report
2.1 24 software LLM low-cost inference, training + report
2.1 40 software LLM low-cost inference, training + report
2.2 24 distilled version of model from WP1
2.2 40 distilled version of model from WP1
3.1 12 augmented dialogue dataset
3.1 24 augmented dialogue dataset
3.2 12 soft + report: adaptation of LLM to dialogue + dialogue summarization
3.2 24 soft + report: adaptation of LLM to dialogue + dialogue summarization
3.2 40 soft + report: adaptation of LLM to dialogue + dialogue summarization
3.3 24 model for dialogue and dialogue summarization
3.3 40 model for dialogue and dialogue summarization
4.1 12 SimSAMU dataset
4.1 24 SimSAMU dataset
4.1 40 SimSAMU dataset
4.2 12 soft + report: ASR domain adaptation
4.2 24 soft + report: ASR domain adaptation
4.2 40 soft + report: ASR domain adaptation
4.3 24 ASR models for meetings and ER calls
4.3 40 ASR models for meetings and ER calls
4.4 12 soft + report: adaptation to ER calls
4.4 24 soft + report: adaptation to ER calls
4.4 40 soft + report: adaptation to ER calls
4.5 24 LLM for ER calls
4.5 40 LLM for ER calls
5.1 30 Workshop
5.2 42 dissemination report
5.3 12 exploitation plan
5.3 36 exploitation plan

PMT meetings

Our PMT meetings occur at 2PM on the first Thursday of every month at URL https://jitsi.linagora.com/llm4all

  • 2nd November 2023
    • Discussion about progress in FT and CL of LLMs
    • Data Management Plan: the DMP is online here or from the top menu
    • TODO everyone (for mid-january): complete a first version by editing this markdown file or by sending me your updates by email
  • 7th December 2023
    • Discussion about progress in finetuning Claire
  • 11th January 2024
    • Progress per partner
    • ANR visio on Feb 2nd with all projects, I'll present LLM4ALL
    • Organizing a workshop with the 3 other ANR TSIA
    • list corpus into the DMP...
  • 1st February 2024
    • Discussion about progress
  • 7th March 2024
    • consortium agreement
    • DMP
    • logo
    • Workshop 25/04: call to partner's contribs 18/03
  • 4th April 2024

Workshop