Promoting fairness for under-represented languages in multilingual LLMs (2025-2026)

This project is funded by the University of Washington (UW) Global Innovation Fund (GIF)   Research award for a UPCIté/UW collaboration to promote fairness for under-represented languages in multilingual LLMs.

 

 

PIs for University of Washington (UW) : Gina-Anne Levow and Richard Wright

PI for Université Paris Cité (UPCité) : Nicolas Ballier

 

 

 

 

 

 

Of the 7,000 languages in the world, a handful are well-represented in language technologies, such as LLMs. Of those, English is vastly over-represented, and only a few dialects are represented. This sampling imbalance is especially acute for spoken language models, such as Whisper (Radford et al, 2023), where English alone accounts for more than ¾ of the training data. Underrepresentation in models leads to poor performance, leaving recent innovations in generative AI out of reach for most of the world’s population. Our collaboration will measure and minimize the training-data bias in multilingual language models like ChatGPT (OpenAI, 2023), Whisper (Radford et al., 2023), and XLM-R (Conneau et al., 2019). We want to address how the lack of training data adversely affects the representations for lower-resource languages, and seek to mitigate these harms through improved representations enhancing both task accuracy and model efficiency.

 

This first event focuses on audio large language models and highlight the role of speech tokenisers. We develop our case studies on Whisper and Wav2vec. A second event will be held in-person  in fall 2025 at UW entitled “Speech and text LLMs revisited: under the hood investigation of multilingual models” with a mixed audience of linguists, engineers and computational scientists.

 

 

FIRST EVENT:     Paris  

Zoom session : 26 May 2025

zoom : https://u-paris.zoom.us/j/86356943682?pwd=llaCqKu8aOajFauHBU36X6DDMddZcb.1
password: 542839

room : 720 (seventh floor)

Bâtiment Olympe de Gouges

8 Place Paul Ricoeur

75013 PARIS

 

 

Accès au bâtiment Olympe de Gouges

 

 

 

This  workshop is intended to discuss various approaches to probing audio LLMs such as Whisper (Radford et al., 2019) or Wav2vec (Baevski et al., 2018).

 

 

FINAL PROGRAMME

(CET time, it’s 7h in Seattle when it’s 16h in Paris CET)

16h Nicolas Ballier (UPCité), Gina Levow & Richard Wright (UW) :  introduction
We briefly present current research in the making on speech language models.

 

PART 1: METHODS

16h05 Guillaume Wisniewki (UPCité) Measuring distance with wav2vec & Whisper

16h20 Taylor Arnold (Richmond) Analyzing Whisper encoder outputs with UNICODE

16h30 Sara Ng (Western Washington University) Realigning Whisper output with human transcriptions

16h40 James Tanner (Glasgow) Analyzing Stops / VOT with Wav2vec

17h00 Discussion  (coffee)

 

PART 2: CASE STUDIES

17h30 Behnoosh Namdarzadeh (UPCité)  Whisper transcriptions for lesser-resourced languages : the case of Persian

17h45 Katelyn Mei (UW) & Anna Choi (Cornell) Careless Whisper

18h15 Gita Dhungana (UW) Nepali

18h30 Discussion  (coffee)

19h00 Siyu Liang (UW) Analyzing Whisper decoder outputs across diverse language resource levels

19h20 Final discussion

19h50 closing remarks

20h end

 

SOME RESOURCES:

Ballier, N., Burin, L., Namdarzadeh, B., Ng, S., Wright, R., & Yunès, J. B. (2024, October). Probing whisper predictions for French, English and Persian transcriptions. In 7the International Conference on Natural Language and Speech Processing (Vol. 7, pp. 129-138) <hal-04547597>

Ballier, N., Arnold, T., Méli, A., Fullerton, T., & Yunès, J. B. (2024). Whisper for L2 speech scoring. International Journal of Speech Technology, 27, 923-934. 

Koenecke, A., Choi, A. S. G., Mei, K. X., Schellmann, H., & Sloane, M. (2024). Careless whisper: Speech-to-text hallucination harms. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 1672-1681.

Namdarzadeh, B., & Ballier, N. (2024, draft). Audio LLM subtokens as encapsulated” knowledge”: the case of Persian subtoken graphemic representations in Whisper. In Grapholinguistics in the 21st century—From graphemes to knowledge.  

Some links to scripts:

The C++ customised version of Whisper we used for encoder outputs is here: https://github.com/jbyunes/whisper.cpp

The alignment script between Whisper encoder predictions and final transcripts  is available on https://github.com/statsmaths/paper-replication

The multilingual and English only Whisper subtoken dictionaries are available as .json files here : https://github.com/nballier/Whisper