LLM4All

Welcome to LLM4All

Open-source LLMs are slowly catching up with ChatGPT. Even though they still lag behind in several key aspects (performance level on a wide range of various tasks, context length...), we expect the moat between open-source and closed-source LLMs to continue shrinking in the near future. Open-source LLMs are also key enabling technologies for research (theorizing training algorithms requires a complete control over them), companies (for privacy, technology ownership), governments (sovereignty and dependence) and the large community of individual practitioners who have already enriched the open-source LLM landscape at an unprecedented pace.

We focus in this project on such open-source LLMs that:

Everyone, preferably with some GPU resources, can deploy in her/his own computers and totally control.
Will be finetuned to better handle human meetings and conversations (but no chatbots!), especially in French.
Will be incrementally updated with the last pieces of news, emerging lexicon, events.
Will be connected to the best speech recognition models (Whisper, MMS...) to handle in particular emergency calls in hospitals.

Funding

LLM4All is a project funded by the French ANR (Agence Nationale de la Recherche).

Consortium

The consortium is composed of, in alphabetical order:

The APHP hospitals in Paris
The Linagora company in Paris, focused on open-source solutions for language
The LIX laboratory in Paris:
- The DaSciM team specialized in data analytics and machine learning
The LORIA laboratory in Nancy:
- The Multispeech team focused on speech, audio and multimodal signal processing
- The Synalp team (leader of LLM4All) specialized in Natural Language Processing

with a strong support from the Hugginface company on LLM training.

Planning

Start date: Oct 1st, 2023
Duration: 42 months

Companion projects

Work packages

Nb	Leader	Name
WP0	LORIA	Project management
WP1	LORIA	Fine-tuning, continual updating
WP2	LIX	Low-cost LLMs
WP3	Linagora	LLMs for spoken dialogue
WP4	AP-HP	Boosting LLMs with other data
WP5	Linagora	Communication, dissemination, exploitation

Contact

cerisara at loria dot fr

Deliverables

T0+6 = 1st April 2024
T0+12 = 1st October 2024

D	date	desc
	1/4/24	DMP
	1/10/24	Accord de consortium
	1/4/25	Rapport intermediaire a 18 mois
	1/10/25	DMP a 18 mois
	31/3/27	Rapport final
	31/3/27	DMP final

0.1	18	progress report v1
0.1	40	progress report v2
0.2	6	DMP v1
0.2	42	DMP v2
1.1	12	software LLM training and evaluation + report
1.1	24	software LLM training and evaluation + report
1.1	40	software LLM training and evaluation + report
1.2	24	model + release every 2 months
2.1	12	software LLM low-cost inference, training + report
2.1	24	software LLM low-cost inference, training + report
2.1	40	software LLM low-cost inference, training + report
2.2	24	distilled version of model from WP1
2.2	40	distilled version of model from WP1
3.1	12	augmented dialogue dataset
3.1	24	augmented dialogue dataset
3.2	12	soft + report: adaptation of LLM to dialogue + dialogue summarization
3.2	24	soft + report: adaptation of LLM to dialogue + dialogue summarization
3.2	40	soft + report: adaptation of LLM to dialogue + dialogue summarization
3.3	24	model for dialogue and dialogue summarization
3.3	40	model for dialogue and dialogue summarization
4.1	12	SimSAMU dataset
4.1	24	SimSAMU dataset
4.1	40	SimSAMU dataset
4.2	12	soft + report: ASR domain adaptation
4.2	24	soft + report: ASR domain adaptation
4.2	40	soft + report: ASR domain adaptation
4.3	24	ASR models for meetings and ER calls
4.3	40	ASR models for meetings and ER calls
4.4	12	soft + report: adaptation to ER calls
4.4	24	soft + report: adaptation to ER calls
4.4	40	soft + report: adaptation to ER calls
4.5	24	LLM for ER calls
4.5	40	LLM for ER calls
5.1	30	Workshop
5.2	42	dissemination report
5.3	12	exploitation plan
5.3	36	exploitation plan

PMT meetings

Our PMT meetings occur at 2PM on the first Thursday of every month at URL https://jitsi.linagora.com/llm4all

2nd November 2023
- Discussion about progress in FT and CL of LLMs
- Data Management Plan: the DMP is online here or from the top menu
- TODO everyone (for mid-january): complete a first version by editing this markdown file or by sending me your updates by email
7th December 2023
- Discussion about progress in finetuning Claire
11th January 2024
- Progress per partner
- ANR visio on Feb 2nd with all projects, I'll present LLM4ALL
- Organizing a workshop with the 3 other ANR TSIA
- list corpus into the DMP...
1st February 2024
- Discussion about progress
7th March 2024
- consortium agreement
- DMP
- logo
- Workshop 25/04: call to partner's contribs 18/03
4th April 2024

Workshop

Kickoff meeting: 11th October 2023 at Linagora's offices, Paris
ANR Workshop: February, 2nd, 2024
- Slides
ANR Workshop: April, 25th, 2024
- Slides
- Poster

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search