PIs for University of Washington (UW) : Gina-Anne Levow and Richard Wright

PI for Université Paris Cité (UPCité) : Nicolas Ballier

Of the 7,000 languages in the world, a handful are well-represented in language technologies, such as LLMs. Of those, English is vastly over-represented, and only a few dialects are represented. This sampling imbalance is especially acute for spoken language models, such as Whisper (Radford et al, 2023), where English alone accounts for more than ¾ of the training data. Underrepresentation in models leads to poor performance, leaving recent innovations in generative AI out of reach for most of the world’s population. Our collaboration will measure and minimize the training-data bias in multilingual language models like ChatGPT (OpenAI, 2023), Whisper (Radford et al., 2023), and XLM-R (Conneau et al., 2019). We want to address how the lack of training data adversely affects the representations for lower-resource languages, and seek to mitigate these harms through improved representations enhancing both task accuracy and model efficiency.

This first event focuses on audio large language models and highlight the role of speech tokenisers. We develop our case studies on Whisper and Wav2vec. A second event will be held in-person in fall 2025 at UW entitled “Speech and text LLMs revisited: under the hood investigation of multilingual models” with a mixed audience of linguists, engineers and computational scientists.