Automatic speaker recognition (ASR): identification, verification and diarization

Published on Nov 22, 2023

Automatic speaker recognition (ASR): identification, verification and diarization

Due to individual differences in physical attributes like vocal tract shapes, every person possesses a distinct voice pattern. In automatic speech recognition (ASR), this uniqueness is harnessed to identify and analyze speakers by extracting and analyzing voice features such as pitch and frequencies.

So, how exactly can AI-powered technologies be utilized for this purpose? This is where automatic speaker recognition comes into play.

If you're building an audio or video-based product, this is a must-read to understand the mechanics of ASR, learn about the differences between identification, verification and speaker diarization, and get an overview of the key factors to consider when building ASR detection systems.

What is speaker recognition?

Speaker Recognition (SR) is an old and critical field of AI, focused on solving the problems of speaker identification, verification and diarization.

Image with ASR speakers recognition definitions: identification, verification, diarization

Before diving into how each of these works, the different use cases they unlock, and how to choose the most suitable multi-speaker system for your needs, let's start by looking at the most fundamental element in the field of speech recognition – sound.

What is the sound ?

Sound is a wave vibration that travels through air until it reaches the listener's ear and allows us to understand what's being said.

The sound in the air is a continuous signal that can be represented in time-domain; it shows the amplitude change of the signal as a function of time.

When it comes to digital audio, the sound is converted into a discrete signal that can be represented in frequency-domain; it shows the amplitude change of the signal over different frequencies.

ASR speakers recognition: continuous signal graph

ASR speakers recognition: discrete signal graph

For AI-powered speech recognition systems, every person’s voice represents a unique sound to be deciphered. The key properties of multi-speaker input audio that are taken into account for speaker recognition include wavelength, frequency, pitch, amplitude and sample rate.

Wavelength

Wavelength refers to the physical length of a sound wave and is measured from one peak of the wave to the following one. Analyzing the wavelength serves to identify unique patterns in speech, as different individuals produce speech with variations in wavelength due to their vocal tract differences.

Frequency

Frequency is the rate at which a sound wave vibrates. It's calculated by the number of complete wave cycles in a second, and is measured in Hertz (Hz). Frequency patterns are used to differentiate between speakers, based on individual differences in vocal ranges and speech patterns.

ASR speakers recognition: frequency graph

Pitch

Pitch is an attribute of sound, which influences how we perceive the highness or lowness of a sound. It's used to differentiate if the tone is strident or flat. Based on the wave's frequency, one can know the quality of sound.

ASR speakers recognition: pitch graph

Amplitude

Amplitude refers to the magnitude or strength of a sound wave. It represents the height of the wave and shows how strong it is. While not used by default for speaker recognition, it can be relevant for normalizing speech signals to ensure consistent analysis across different recordings.

Amplitude as part of the soundwave graph in ASR speakers recognition

Sample rate

Measured in Hz, sample rate refers to the number of the samples of audio carried per second. Each sample consists of several conversions of small parts of the signal. It generates an array that's used as a digital representation of the signal.

ASR speakers recognition: sample rate graph — The line represents a sound wave, while the dots are samples.

Sample rate plays a key role in automatic speaker recognition, helping to capture detailed information about the speaker’s particular pronunciation, intonation, and other speaker-specific features.

For best results, it’s better to have a higher sample rate to capture more nuances. This is done by minimizing the distance between samples, providing a closer representation of the original wave. Bear in mind that, while favorable for speaker recognition, higher sample rates require additional resources.

There are minimal frequencies under which the sound is intelligible for ASR models. Most audio files being produced today are at a minimum of 40 kHz, but some types of audio – such as phone recordings from call centers – are at lower frequencies, resulting in recordings at 16 kHz or even 8 kHz. Higher frequencies, such as mp3 files at 128 kHz, need to be resampled.

How automatic speaker recognition works

The elements of sound theory, including wavelength, frequency, pitch, and sample rate, are essential prerequisites for speaker recognition models to extract unique characteristics of voice.

By analyzing patterns in wavelength, frequency distribution, pitch variations, and even considering aspects like amplitude and sample rate for signal quality, these systems can differentiate between different speakers and improve accuracy in speaker recognition tasks.

There are two approaches used in the speaker recognition field, specifically for identification, to carry out this process.‍

Audio fingerprinting

‍Audio fingerprinting is a classic method of speaker recognition based on extracting a distinctive identifier, or fingerprint, from the audio signal. This technique utilizes Fourier transform to generate a spectrogram from the audio file, enabling comparisons with the existing database. Specifically, it zeroes in on unique features of the speaker's voice, particularly in ASR detection tasks.

ASR speakers recognition methods: audio fingreprinting

Machine learning

One of the most well-known alternative techniques to audio fingerprinting involves training a machine learning (ML) model on a diverse dataset encompassing various speakers and speech lengths.

This trained model can then be employed to authenticate speaker identity in unfamiliar audio files that were not part of its training data. This approach provides the most relevant result using a scoring technique. To dive deeper into the ASR models and how they function, explore our introduction to speech-to-text AI.

ASR speakers recognition methods: machine learning (ML)

Note that the two approaches are not mutually exclusive: they can work simultaneously or by combining them together. For example, we can use the audio fingerprinting approach for speaker identification and the ML approach for speaker verification, or for robustness and speed in identification, we can use fingerprinting while using ML to improve accuracy over time.

Audio fingerprinting vs ML approach

Accuracy: ML gives better results since the model is specifically trained on versatile datasets to accurately capture a variety of voices.
Speed: Audio fingerprinting is faster since it gets the spectrogram of the audio then searches for what is exactly in the database.
Unseen data: ML gives better results since the model can work with previously unknown data, while fingerprinting is limited to searches within the database.
Scalability: Audio fingerprinting is easier to scale by adding audio files to the database, without requiring model retraining like ML.
Computational cost: ML has a higher cost since it requires more computational power to make inferences.

Automatic speaker recognition: key terms and uses cases

Speaker identification

Speaker identification is the process of identifying which of the known speakers have spoken. Unlike ASR which focuses on spoken speech, this process usually is text independent, meaning that we don't focus on what is being said, but how it's being said. This can be done in two ways as described above: either by getting a spectrogram from the audio and comparing it with what we have in a database or by leveraging ML to extract features from audio and using trained models to predict the speaker with the most similar voice.

Example use cases: Access control, forensic investigations

Speaker verification

Speaker verification is the process of verifying a speakers identity and ensuring that they are the correct individuals. This process of verification could be text-dependent or independent. It's done by extracting speaker embeddings from the audio and then comparing it to the original reference to ultimately return a result based on a similarity score.

Example use cases: Security and fraud

Speaker diarization

In audio transcription, speaker diarization is a process of automatically partitioning an audio recording into segments that correspond to different speakers. This is done by using a combination of techniques to distinguish and cluster segments of an audio signal based on speaker embeddings, producing a transcript with multiple speakers.

For a more information on the inner workings of diarization, check out our blog.

Example use cases: Conferences, media monitoring, security identification

Building multilingual speaker recognition into your product: key factors to consider

While ASR systems are becoming more robust and better at recognizing speakers, the modern business environment presents an additional challenge – the fact that communication occurs in multiple languages.

Podcasts, meetings and live events can have a lot of speakers with different languages and dialects. It may even happen that the same speaker switches from one language to the next in the course of the same audio recording – but is that a task all ASR systems are fit to solve?

Thanks to the latest advances in ASR, speaker recognition mechanisms have evolved significantly from acoustic-based and/or fingerprinting recognition of speakers to a sophisticated ML-driven approach, based on embeddings that contain nuanced speaker information, traceable across recordings.

That said, certain factors need to be taken into account when integrating speech recognition capabilities into an ASR product, be it through an open-source platform or an API route.

1- Acoustic challenges: Different languages possess distinct acoustic properties crucial for speaker identification. What works well for one language might not be as effective for another.

2- Linguistic challenges: Speakers exhibit diverse speech patterns based on their language. Some may speak more rapidly than others due to both personal specificities as well as inherent language traits: for example, Japanese was found to be the fastest spoken language in the world, while Mandarin - the slowest, based on the number of syllables spoken per second/minute. This presents an obvious challenge for accurate recognition - especially in an audio where the same speaker intervenes in multiple languages.

3- Audio quality: Whatever the language, high-quality input audio gives better results than audio recorded in noisy environments like call centers or emergency services. To mitigate this, special model optimization techniques, akin to the one deployed by Gladia to optimize Whisper ASR, are needed.

To properly address these challenges, the following features are a must-have to achieve the best performance:

Automatic language detection - the process of automatically identifying the language based on the spoken speech by speakers, integral to any efficient automatic speech recognition system.
Accuracy - transcribing speech of multiple speakers in multiple languages with minimal errors and hallucinations is key for text-dependent identification and speaker diarization.
Speed - becomes especially important if you want to derive speaker-based insights in real-time from a customer call, a live streaming event or meeting.‍
Code-switching - enables the ASR system to accurately transcribe and diarize audio even when speakers switch between languages.

By integrating robust multi-speaker transcription and advanced language support into your product, you can unleash a wealth of features that empower users to maximize their experience during meetings, conversations, and live events.

At Gladia, we provide a plug-and-play audio intelligence API, including state-of-the-art diarization, live transcription, translation and code-switching among other features, enabling virtual meeting, media and call center companies to harness the benefits of AI for voice. Feel free to learn more about our product and pricing, or reach out to us directly for a demo.

Conclusion

As we discussed in this article, automatic speaker recognition encompasses a powerful set of AI technologies, used for a variety of crucial identity-based tasks. With speaker recognition and ASR combined, companies can benefit from additional insights into user behavior and leverage this data for better analysis, improved customer support and more informed decision making.

Article written for Gladia by Ahmed Esawy.

Other useful resources on the topic

Deep dive into speaker diarization: what it is, how it works, and how to use it.
Guide on picking the right speech-to-text provider for your Speech AI journey.

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Read more

Speech-To-Text

How real-time STT empowers multilingual support & unlocks international growth

Businesses expanding globally face an immediate language barrier. Customers want service in their native tongue, but most companies and call center providers don’t have enough multilingual agents to meet that demand.

Speech-To-Text

Live transcription made simple with Twilio, Python & Gladia

Live voice AI is no longer a concept of the future. From customer support to smart IVR (Interactive Voice Response) systems, speech is now transcribed in real time—often before the speaker finishes a sentence.

Product News

Getting started with Gladia: How to build with our STT API features

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.

Read more

Speech-To-Text

How Real-Time STT Enables Multilingual Support at Scale

Speech-To-Text

Live transcription made simple with Twilio, Python & Gladia

Product News

Getting started with Gladia: How to build with our STT API features

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.