Welcome to LLM4All
Open-source LLMs are slowly catching up with ChatGPT. Even though they still lag behind in several key aspects (performance level on a wide range of various tasks, context length...), we expect the moat between open-source and closed-source LLMs to continue shrinking in the near future. Open-source LLMs are also key enabling technologies for research (theorizing training algorithms requires a complete control over them), companies (for privacy, technology ownership), governments (sovereignty and dependence) and the large community of individual practitioners who have already enriched the open-source LLM landscape at an unprecedented pace.
We focus in this project on such open-source LLMs that:
- Everyone, preferably with some GPU resources, can deploy in her/his own computers and totally control.
- Will be finetuned to better handle human meetings and conversations (but no chatbots!), especially in French.
- Will be incrementally updated with the last pieces of news, emerging lexicon, events.
- Will be connected to the best speech recognition models (Whisper, MMS...) to handle in particular emergency calls in hospitals.
Funding
LLM4All is a project funded by the French ANR (Agence Nationale de la Recherche).
Consortium
The consortium is composed of, in alphabetical order:
- The APHP hospitals in Paris
- The Linagora company in Paris, focused on open-source solutions for language
- The LIX laboratory in Paris:
- The DaSciM team specialized in data analytics and machine learning
- The LORIA laboratory in Nancy:
- The Multispeech team focused on speech, audio and multimodal signal processing
- The Synalp team (leader of LLM4All) specialized in Natural Language Processing
with a strong support from the Hugginface company on LLM training.
Planning
- Start date: Oct 1st, 2023
- Duration: 42 months
Companion projects
Work packages
Nb | Leader | Name |
---|---|---|
WP0 | LORIA | Project management |
WP1 | LORIA | Fine-tuning, continual updating |
WP2 | LIX | Low-cost LLMs |
WP3 | Linagora | LLMs for spoken dialogue |
WP4 | AP-HP | Boosting LLMs with other data |
WP5 | Linagora | Communication, dissemination, exploitation |
Contact
cerisara at loria dot fr
Deliverables
- T0+6 = 1st April 2024
- T0+12 = 1st October 2024
D | date | desc |
---|---|---|
1/4/24 | DMP | |
1/10/24 | Accord de consortium | |
1/4/25 | Rapport intermediaire a 18 mois | |
1/10/25 | DMP a 18 mois | |
31/3/27 | Rapport final | |
31/3/27 | DMP final | |
0.1 | 18 | progress report v1 |
0.1 | 40 | progress report v2 |
0.2 | 6 | DMP v1 |
0.2 | 42 | DMP v2 |
1.1 | 12 | software LLM training and evaluation + report |
1.1 | 24 | software LLM training and evaluation + report |
1.1 | 40 | software LLM training and evaluation + report |
1.2 | 24 | model + release every 2 months |
2.1 | 12 | software LLM low-cost inference, training + report |
2.1 | 24 | software LLM low-cost inference, training + report |
2.1 | 40 | software LLM low-cost inference, training + report |
2.2 | 24 | distilled version of model from WP1 |
2.2 | 40 | distilled version of model from WP1 |
3.1 | 12 | augmented dialogue dataset |
3.1 | 24 | augmented dialogue dataset |
3.2 | 12 | soft + report: adaptation of LLM to dialogue + dialogue summarization |
3.2 | 24 | soft + report: adaptation of LLM to dialogue + dialogue summarization |
3.2 | 40 | soft + report: adaptation of LLM to dialogue + dialogue summarization |
3.3 | 24 | model for dialogue and dialogue summarization |
3.3 | 40 | model for dialogue and dialogue summarization |
4.1 | 12 | SimSAMU dataset |
4.1 | 24 | SimSAMU dataset |
4.1 | 40 | SimSAMU dataset |
4.2 | 12 | soft + report: ASR domain adaptation |
4.2 | 24 | soft + report: ASR domain adaptation |
4.2 | 40 | soft + report: ASR domain adaptation |
4.3 | 24 | ASR models for meetings and ER calls |
4.3 | 40 | ASR models for meetings and ER calls |
4.4 | 12 | soft + report: adaptation to ER calls |
4.4 | 24 | soft + report: adaptation to ER calls |
4.4 | 40 | soft + report: adaptation to ER calls |
4.5 | 24 | LLM for ER calls |
4.5 | 40 | LLM for ER calls |
5.1 | 30 | Workshop |
5.2 | 42 | dissemination report |
5.3 | 12 | exploitation plan |
5.3 | 36 | exploitation plan |
PMT meetings
Our PMT meetings occur at 2PM on the first Thursday of every month at URL https://jitsi.linagora.com/llm4all
- 2nd November 2023
- Discussion about progress in FT and CL of LLMs
- Data Management Plan: the DMP is online here or from the top menu
- TODO everyone (for mid-january): complete a first version by editing this markdown file or by sending me your updates by email
- 7th December 2023
- Discussion about progress in finetuning Claire
- 11th January 2024
- Progress per partner
- ANR visio on Feb 2nd with all projects, I'll present LLM4ALL
- Organizing a workshop with the 3 other ANR TSIA
- list corpus into the DMP...
- 1st February 2024
- Discussion about progress
- 7th March 2024
- consortium agreement
- DMP
- logo
- Workshop 25/04: call to partner's contribs 18/03
- 4th April 2024
Workshop
- Kickoff meeting: 11th October 2023 at Linagora's offices, Paris
- ANR Workshop: February, 2nd, 2024
- ANR Workshop: April, 25th, 2024