How does automatic speech recognition navigate languages

Published on Sep 24, 2024

From virtual meeting assistants to call center platforms, the need for multilingual transcription to serve a global user base is on the rise. As a product builder looking to implement automatic speech recognition (ASR) into your app, you're likely searching for a solution that can transcribe speech into multiple languages in real time and translate it accurately into the language of choice.

However, not all speech recognition providers are equally fit for the task. Traditionally, audio clips need to be at least 30 seconds long to identify the language, relying on limited APIs that support only a few languages. Factors such as background noise and accents can also affect the ability to detect the language properly. Messages get lost in translation. Participants get frustrated.

In this article, we’ll take a look at how state-of-the-art speech recognition systems navigate the multilingual world, dive into the mechanics of language detection, explore the capabilities of models such as OpenAI Whisper, and discuss Gladia’s approach.

From statistical models to transformers

Legacy speech-to-text systems have used a combination of acoustic, lexicon, and language models to transcribe speech. However, they were limited in their ability to accurately recognize different accents, dialects, and speech styles and required specialized expertise. Experienced phoneticians had to collaborate to create a custom set for a language and repeat the process for each language the model would support.

The 1990s saw the introduction of statistical models like Hidden Markov Models (HMMs), allowing researchers to start developing more sophisticated language models that could help in recognizing patterns across different languages.

In the following decades, the rise of deep neural networks (DNN) and recurrent neural networks (RNNs) enabled better handling of the nuances of different languages, significantly improving accuracy.

In 2014, end-to-end systems began to emerge. Researchers Alex Graves from Google DeepMind and Navdeep Jaitly from the University of Toronto integrated language detection into ASR and allowed for more seamless transitions between languages.

However, the groundbreaking development happened in 2017 with the rise of transformers and a self-attention mechanism that could detect language shifts in real-time conversations and better understand context.

Transformer-based multilingual ASR systems use language detection as their ears. Language detection leverages deep learning models trained on vast amounts of multilingual audio data, analyzes the incoming speech to identify subtle variations in speech patterns, and tailors the processing for that specific language.

You can leverage pre-trained models and fine-tune them for specific tasks such as translation. Let's say that you have a model that is great at understanding generic texts in Dutch, but you want to fine-tune it to jargon-heavy medical terminology. To make the model understand the nuances and jargon of that domain, you'd fine-tune it with a bunch of medical articles.

While some models can identify the language on the fly, others – including among the Big Tech providers – require you to specify the input language to transcribe it correctly. However, very few, if any, models have been equally exposed to all languages and will have a different word error rate (WER) – a metric used in ASR to assess accuracy – depending on the language.

A notable example of a transformer-based ASR system that has reshaped the field of multilingual speech recognition is Whisper by OpenAI.

What makes Whisper multilingual?

Whisper, one of the best multilingual models on the market, has been trained on 680,000 hours of audio data from the Internet, of which 117,000 hours (roughly ⅓ ) represent data in various languages and speech domains other than English

The model supports more than 99 languages, including English, Spanish, French, German, and Chinese. Currently, the only translations supported are from non-English languages to English. Whisper can be fine-tuned to introduce new languages, dialects, and accents.

Here are some of the key features that contribute to Whisper's multilingual capabilities:

Transformer architecture. Transformers use self-attention mechanisms to evaluate the importance of different words in a sentence and capture contextual relationships.
Multitask learning. Whisper has been trained to perform multiple tasks simultaneously, such as transcribing speech into text, detecting and identifying the language spoken in audio, and translating speech into text in a different language.
Subword tokenization. This technique splits words into smaller subword units. Subwords that are common in several languages can be reused, improving Whisper's efficiency and ability to recognize speech in different languages with less training.‍
Cross-linguistic transfer learning. Since Whisper has been trained on a large and diverse dataset, it can transfer knowledge from one language to another, especially when there are similarities in phonology or syntax.

In November 2023, OpenAI released Whisper Large-v3, an ambitious speech-to-text model to add to the Whisper family. Billed as a solution to the problem of low-resource languages, it promised extensive multilingual support.

The new version does appear to have resolved the problem to a certain point, but not entirely and it also introduced or enhanced the issues like hallucinations and unreliable accuracy.

If you’re interested to learn more, check out this recent analysis by our CEO.

Overcoming challenges

While multilingual ASR has come a long way, there are still challenges to overcome.

Low-resource languages

ASR models need to learn more languages. Majority of audio used for training today is in English. The lack of extensive, high-quality datasets for low-resource and under-represented languages limits the ability of models to learn the nuances of these languages. Besides Whisper large v3, notable efforts like Mozilla's Common Voice, and Meta's "No Language Left Behind" project also tried to address this gap.

Accents

A common hurdle is misinterpretation caused by strong accents. Traditional systems struggle when English or any other language is spoken with a heavy accent, often mistaking it for another.

The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception. MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. The method emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.

However, this approach has a limitation: it's based purely on acoustics. If you speak English with a strong accent, the system may not understand the content but instead identify the language based on your prosody: rhythm, stress, and intonation.

Code-switching

Another challenge is code-switching. Speakers may switch between languages mid-sentence, and the ability to adapt to these changes is crucial for effective communication. However, ASR systems are usually trained on data in just one language at a time.

Bridging the language gap

Despite these challenges, multilingual ASR with robust language detection and translation capabilities unlocks a range of use cases and benefits for businesses, including:

Call centers and sales-focused meeting platforms: Language detection and translation can generate notes in multiple languages, transcribe international meetings, and provide guidance for on-call agents and sales reps.
Multinational businesses: Real-time translation during meetings and conferences fosters smoother communication and collaboration among geographically dispersed teams.
Customer service: Providing support in a customer's preferred language enhances satisfaction and builds trust.
Education: Real-time translation during lectures or language practice sessions allows language learners to better grasp concepts while hearing them in their native language.‍
Healthcare: Removing the language barrier between patients and healthcare providers ensures more effective and clear communication.

How to choose a multilingual ASR provider

Here are some important factors to consider:

Word Error Rate (WER): The most commonly used parameter for evaluating ASR models. It is the proportion (or percentage) of errors it makes at the word level compared to a human-transcribed ground-truth transcript. WER varies depending on the language. For example, the WER for the Dutch language in the Whisper large-v3 model is reported to be around 4.3, while for Vietnamese it’s 39.8.
‍‍
Multi-language support: Whether you currently or in the future anticipate handling multiple languages, opting for a service with extensive language support is wise. When selecting a provider, make sure to test it yourself instead of trusting how many official languages are supported.
‍‍
Speed: Many applications require fast turnaround time and excellent throughput. A responsive system will provide reduced latency and quick processing rates.
‍
‍Cost: Solutions that do not provide appropriate ROI and a decent price-to-performance ratio will limit the overall use of the end-user application.
‍
Customization, flexibility, and adaptability: One size doesn’t fit all. The ability to customize models for unique terminology or jargon and flexible deployment choices to suit project-specific privacy, security, and compliance requirements is essential.

Scalability and reliability: A solid solution can handle various audio data volumes, from small startups to large organizations. Similarly, frequent or prolonged service interruptions can hurt revenue and brand reputation.
‍
‍Ease of adoption and use: Flexible pricing and packaging choices, including usage-based pricing and volume discounts are a prerequisite.

Support and subject matter expertise: Domain expertise in AI, machine learning, and speech recognition are invaluable when problems emerge. Many providers outsource model creation or include it as an add-on to their primary service. Providers that specialize in voice AI can better diagnose and handle challenges, avoiding problems with stagnant performance over time.

The Gladia approach: Revolutionizing multilingual speech recognition

Our team at Gladia has been working on a solution that can identify language changes in real-time — in a matter of milliseconds — and adapt throughout the conversation to enable seamless multilingual interactions.

Here’s how we approach common multilingual ASR challenges.

Identifying under-represented languages

One of the biggest challenges in ASR transcription is supporting and detecting under-represented languages.

Let’s take Ukrainian and Russian as an example. The two languages share similar prosody and vocabulary, making it difficult to accurately detect the language.

In addition, the Ukrainian language, with its unique characteristics, is often underrepresented in training datasets, leading to potential biases in language recognition — Ukrainian speakers can be misidentified as Russian.

To address this challenge, we've developed a mix of three systems:

Candidate language detection based on acoustic signatures
Intrinsic conversation analysis
Balancing algorithm to account for the imbalance in the dataset

This approach allowed us to incorporate linguistic nuances into the language detection system.

Hybrid system for accent management

One of the significant advantages of our solution is its ability to handle accents.

We've developed a unique hybrid approach that combines machine learning and rule-based systems to manage accents

If we relied solely on vocal spectrum analysis, speakers who don’t have a native English accent might be misidentified. That’s why our solution doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks.

The result? We've solved 99% of edge cases involving strong accents in language detection. A huge win for inclusivity and accuracy in speech recognition technology.

A novel approach to code-switching

Each speaker has their own unique vocal signature, which we call features. By analyzing the vocal spectrum, machine learning algorithms can perform classifications. We use the Mel Frequency Cepstral Coefficients (MFCC) to extract the main frequency characteristics.

However, in case of non-native speakers, the system may struggle to accurately detect the language. For instance, a French speaker may have a strong accent, which could lead to language misidentification as French even when speaking English.

To address this, we've introduced code-switching, a technique that allows our system to focus on specific language pairs and limit code-switching detection to user-defined languages. This approach reduces the exploration space and enables more accurate language detection in real-time.

Code switching is especially important in multilingual conversations, where speakers may switch between languages mid-sentence, and where the ability to adapt to these changes is essential.

The future of multilingual speech recognition

As we continue to refine our technology, we're committed to making multilingual speech recognition more accurate, efficient, and accessible.

Gladia's API leverages advanced ASR technology to automatically detect the language, recognize speakers, and accurately transcribe the audio even if speakers switch languages throughout the conversation. The API currently supports 100+ languages for transcription, diarization, and translation.

About Gladia

Gladia provides a speech-to-text and audio intelligence API for building virtual meetings, note-taking apps, call center platforms and media products, providing transcription, translation and insights powered by best-in-class ASR, LLMs, and GenAI models.

Follow us on X and LinkedIn.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Getting started with Gladia: How to build with our STT API features

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.

Case Studies

How real-time transcription creates a competitive advantage in fintech

Fintech is evolving fast. Gone are the days of clunky logins and endless passwords. Today, users expect seamless account access, minimal friction and one-click payments.

Speech-To-Text

Real-time agent assist: Unlocking better call center services with speech-to-text

Customer service is evolving fast to meet new challenges. Today's clients expect immediate, accurate answers to increasingly specific queries and complaints. Meanwhile, contact centers need to reduce costs, improve efficiency, and maintain compliance…all while delivering exceptional experiences.