Transcribing long audios with Whisper using Python and Gladia API
Published on Dec 8, 2023
Whisper ASR model released by OpenAI is great for providing transcriptions from audio files but doesn’t come without challenges. In addition to high computational requirements and expenses, Whisper comes with a limit of 25 MB and 30 seconds in duration on input audio files, which usually requires splitting larger audio files into chunks to be transcribed.
This method is not only impractical and time-consuming but also reduces the quality of the resulting transcription, which poses a huge inconvenience for enterprise-grade projects. In this article, we explore Gladia speech-to-text API, powered by an optimized hallucinations-free version of Whisper, as a production-grade API alternative to the original model.
Whisper ASR limitations: Long audio
Released in open access in 2022, Whisper ASR was a truly remarkable achievement in the field of automatic speech recognition, which set a new standard for accuracy and multilingual capabilities. While it remains perfectly suitable for indie projects and academic research, the open-source model comes with a number of limitations that make it challenging to use at scale for ever-growing enterprise needs and applications.
Take Whisper’s input requirements. When going through OpenAI’s FAQ, we see the community raise issues with the audio upload size limit. One such user complained about receiving a size limit exceeded error despite uploading an audio file less than 25 MB, while another user stated that “Currently, the Whisper model only supports video files that are up to 30 seconds long [..].” With Whisper API, there’s also a limit on concurrent requests per minute.
How does Gladia address Whisper’s shortcomings?
Gladia provides an optimized version of OpenAI's Whisper that solves the key limitations of the original model. Gladia’s hybrid architecture uses an ensemble of machine learning models to optimize each step of the transcription process, which helps to eliminate the OpenAI Whisper hallucination, resulting in a more accurate and reliable transcription service. Additionally, Gladia offers several useful features not available with the original model, such as real-time transcription, speaker diarization, and code-switching.
We have, of course, also addressed the Whisper model’s file size limitation. With our API, enterprise users can now upload audio files up to 500MB in size and up to 135 minutes long, extendable upon demand. This eliminates the need to manually process input files, enabling a hassle-free experience for a company to transcribe multiple large audio or video files of any format.
Unlike Whisper, our API can also process URLs and callbacks. We provide webhooks and support SRT and VTT output formats optimized for media captions, too. In short, you don’t have to worry about formats, sizes, and other input parameters – we take care of everything.
Overview and prerequisites
This tutorial is intended for developers who want to transcribe audio or video files of any size using Gladia's API. To follow along with this tutorial, you must have
1. A strong understanding of the Python programming language
Please note that while this tutorial will focus on simplicity when handling the API, it is important to follow best practices in a production environment by storing your API key as an environment variable. This will help to keep your API key secure and prevent unauthorized access. Gladia also supports Javascript and PHP.
The full code used in this tutorial is located in this GitHub repository.
Setting up Gladia API
Features of Gladia API
The features Gladia API provides are as follows:
1. Real-time transcription: This feature utilizes webhooks to receive audio streams in real-time and then automatically returns an audio transcription. This helps businesses to easily take notes of what is being said during meetings and beyond.
2. Speaker diarization: Audio recordings can feature one or more speakers, and this necessitates identifying and separating the speakers during transcription. Gladia’s API achieves speaker diarization easily through our proprietary diarization mechanism, which delivers state-of-the-art performance.
3. Word-level time stamp: Our API also provides a feature where each word in the resulting transcript is given an accurate time stamp, which can prove useful when editing videos and adding subtitles
4. Translation: With our API, you can easily receive your transcripts in any language you desire by simply setting a desired output language. Gladia API supports translation from any-to-any of the 99 supported languages with exceptional accuracy and lower word error rates in most of them.
5. Code-switching: Our API can easily handle difficult situations where speakers in an audio recording are conversing and switching between one or more different languages, providing accurate transcripts.
Registration and obtaining API credentials
The first step in this tutorial is to get your own Gladia API key, and you can do this by following the steps below.
3. Select the API Keys header, add a short description, and generate an API key
Transcription using Gladia
In this tutorial, we will guide you through the process of transcribing an audio file with two speakers, lasting one hour and having a file size of 60 MB.
To begin, create a Python file. The name of the file can be anything you want, but in this tutorial, we will name it "main.py." Import the packages that we will be using, which are the os package for interacting with the operating system and the requests library for making requests to the Gladia API.
import requests
import os
# set your API key here. Note that you can make use of environment variables
# for better security and to avoid exposing your API to the public
gladia_key = ''
Next, we define a Python function named audio_transcription with a parameter filepath of type string, which expects the path to the audio file to be transcribed. Inside this function, we also define a header parameter to hold the API you defined above.
def audio_transcription(filepath: str):
# Define API key as a header
headers = {'x-gladia-key': f'{gladia_key}'}
In the following line of code, we use the splitext method from the os package to split the input filepath into a filename and a file extension. We do this because during the preparation of data for API requests, the audio parameters, the filename, the audio file, and the content type(the file extension will be useful here).
# Split the filename and extension
filename, file_ext = os.path.splitext(filepath)
To prepare the necessary data for making the API request, we define a dictionary with several keys. The audio key is used to specify the metadata for the audio file, which are the filename, audio file, and content type.
Also, due to our audio file containing 2 speakers, we set the toggle_diarization key to True and the diarization_max_speakers to 2 to force the model to recognize not more than 2 speakers from the audio file.
We have also specified the output format to 'txt' to allow for a full combination of the transcription in the API response.
with open(filepath, 'rb') as audio:
# Prepare data for API request
files = {
'audio': (filename, audio, f'audio/{file_ext[1:]}'), # Specify audio file type
'toggle_diarization': (None, True), # Toggle diarization option
'diarization_max_speakers': (None, 2), # Set the maximum number of speakers for diarization
'output_format': (None, 'txt') # Specify output format as text
}
print('Sending request to Gladia API')
# Make a POST request to Gladia API
response = requests.post('https://api.gladia.io/audio/text/audio-transcription/', headers=headers, files=files)
if response.status_code == 200:
# If the request is successful, parse the JSON response
response = response.json()
# Extract the transcription from the response
prediction = response['prediction']
# Write the transcription to a text file
with open('transcription.txt', 'w') as f:
f.write(prediction)
return response
else:
# If the request fails, print an error message and return the JSON response
print('Request failed')
return response.json()
Subsequently, we invoke the audio_transcription function and provide a file path for the audio file that we desire to transcribe. The audio file can be downloaded here.
From the code above, we have also set the function to automatically save the full transcription into a text file named transcription.txt.
audio_transcription('./podcast.mp3')
After running the code that saves the transcription in a text file, we can observe that Gladia can accurately identify the speakers in the audio file as well as provide accurate transcriptions without the OpenAI Whisper hallucination problem. You can view the full transcription here.
Note that although an MP3 audio file was used in this tutorial, it is important to note that Gladia can accept a variety of other media formats, as well as URLs to an audio/video file.
Conclusion
The original Whisper model from OpenAI requires splitting audio larger than 25 MB into chunks, which often results in lower-quality transcriptions. At Gladia, we have optimized the Whisper model with newer features while increasing the audio file limit to 500 MB for a more seamless experience. Our latest model, Whisper-Zero, addresses usage limitations, improves accuracy across languages, and more.
This tutorial has demonstrated how to transcribe long audio files using the Gladia API. We generated an API key, defined the features we wanted the model to use, made a request to the API, and saved the transcript to a text file. If the generated transcript is too long to read, please refer to this tutorial, which teaches you how to summarize audio files using Whisper ASR and GPT 3.5.
Feel free to experiment with other features availble with our API and customize the main.py file to suit your personal needs. To try Gladia, sign up directly below.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
What startups should look for in a speech-to-text API
The revolution in both LLMs and voice technology in recent years has opened up unprecedented opportunities for startups. From virtual meeting assistants to AI voice agents, speech-to-text (STT) capabilities are becoming central to modern applications. However, choosing the right STT API provider involves navigating a complex landscape of technical specifications, features, and trade-offs that can significantly impact your product's success.