Following the release of ChatGPT, prompt engineering for LLMs became one of the most widely-discussed fields in AI. Prompt injection in Speech Recognition in particular, used to guide the underlying model to produce more accurate results, is a fascinating NLP technique worth exploring in more detail.
Relying on Gladia’s expertise in Audio Intelligence AI, in this blog I will cover how prompt injection in Speech Recognition is used to enhance Automatic Speech Recognition (ASR) by improving previous — heavy, yet complementary — speech adaption or keyword-boosting methods.
Setting the scene
As for many AI systems in the last few years — and even many years before with more classical techniques — Open AI’s Whisper is a model that uses an encoding / decoding technique, presenting some large advantages that allow for abstract mathematical manipulations.
The basic mechanism here can be illustrated with a triangle connecting audio, latent space, and the final word transcript.
Let’s zoom in a little. Here comes the general view of audio processing in Automatic Speech Recognition (ASR).
Before going into the details of how all of these components come together, the first thing to understand is:
The main objective of the prompt injection approach is to provide some guidance to the decoder in order to capture the right words and perform better — thanks to context-setting.
Now, let’s dive into the key elements of ASR and explore why prompt injection is a miraculous addition to the mix that can drastically improve the quality of audio transcription.
What are Features in Speech Recognition?
Performing Speech-to-Text is like solving a big Lego puzzle: the speech needs to be broken into smaller elementary bricks called “features” or “tokens” in order to be processed.
These elementary features are able to “embed”multiple aspects of the original speech segment, such as the speech tone, pitch, volume, or even speech rate.
In speech recognition, these features are extracted using a “pre-processing” step, which is usually done by performing spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs) speech representation. Here, the audio basically becomes a simple “picture” representation of the audio.
What are Encoders and Latent Spaces?
Continuing with the Lego analogy, it’s now time to put some magic magnets on the back of each brick — so that the bricks that were originally together can form a magic link.
During the training phase, the strength of the magnetic links can vary between the different bricks to finally form a fixed picture of “link strengths” between them.
This final picture, Latent Space, contains Lego bricks / tokens (illustrated as colored dots below) with unique identifiers.
Another way to see it is that during the encoding phase, some elementary parts of speech (i.e. tokens) can be more or less attracted toward other tokens. The vector representations of these interactions (or “magic links”) are known as “embeddings”.
Disclaimer: This is a very inaccurate explanation from a scientific perspective — but it helps to emphasize that not all tokens have a role to play in the Latent Space and therefore are more or less likely to be “activated” or “picked” (not at all if they are very shy ;-) ).
Explaining this to one of our customers was like a revelation to him: “It’s like Esperanto for computers!”. And it’s exactly that.
What is a Decoder
A “decoder” is a Lego brick builder that combines the pieces based on their interactions (embeddings) and translates the beautiful Lego tapestry into a humanly comprehensive structure — the written language. It takes the unique token identifiers and converts them into their equivalent in words (or pieces of words).
Prompt Injection in Speech Recognition
So finally, what’s “prompt injection”?
Prompt injection is a technique that will change the interactions between the bricks by dropping a giant magnet in the middle of the Magic Magnetic Legos.
It will help the decoder handpick pieces closer to the giant magnet, and pieces that might not have been originally selected (because they were too shy) can now emerge by virtue of being found closer to the giant magnet.
A concrete example: Fiber vs. Cider
Let’s imagine we’re working with a poor-quality audio file where it’s hard to hear if the speaker talks about “Fiber” or “Cider”.
Fiber has strong interactions, well-connected in the Latent Space, as it has been very frequently seen in the training dataset and is linked to many concepts — while Cider is not.
By guiding the transcription using a prompt such as “this conversation is about telecoms and internet equipment”, we’re dropping Red far from Cider and close to Fibre, which will then be classified by AI as the perfect candidate.
In this scenario, Cider and Fiber remain very close regarding feature representation (they look alike in tone, energy, pitch, volume, …). In other words, the color and shape of the Lego brick Cider are very close to that of Fiber.
But, as the prompt is introduced, the magnetic sticker on the Fiber brick becomes way stronger and more connected to the rest of the Legos in the game (Latent Space).
Now, let's drop another prompt injection: “this conversation is about alcohol and drinks”. Now the Red Magnets are really close to Cider, and even if the Lego Brick magnet stick on the back of Cider is not originally as strong as the Fiber one, it’s more likely to be picked up in the end.
Other more conventional and complementary techniques
Many other techniques can help to make transcription more accurate. For example, keyword boosting can be considered as virtually increasing the Lego brick's size to make the magnetic sticker bigger and more attractive, so more likely to be picked.
Speech adaptation is another technique that consists of adapting the encoder to understand new “waveform shapes” and adapt the mapping into the Latent Space.
Here is a more comprehensive map of corrective measures to improve speech recognition quality.
In conclusion, we have seen that prompt injection in speech recognition is a fascinating new tool — albeit with a malicious potential — complementary to the existing techniques that will help the ASR tech reach superior quality standards.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
What startups should look for in a speech-to-text API
The revolution in both LLMs and voice technology in recent years has opened up unprecedented opportunities for startups. From virtual meeting assistants to AI voice agents, speech-to-text (STT) capabilities are becoming central to modern applications. However, choosing the right STT API provider involves navigating a complex landscape of technical specifications, features, and trade-offs that can significantly impact your product's success.