Thinking of using open-source Whisper ASR? Here are the main factors to consider

Published on Jul 15, 2023

Perhaps you’re a developer looking for an Automatic Speech Recognition (ASR) solution for the first time. Or an executive looking for more affordable, faster, more accurate alternatives to the mainstream speech-to-text solutions for your business. Where do you turn to?

Generally, you’ve got two options — build a solution in-house using open-source model, like OpenAI Whisper ASR, or pick a specialized speech-to-text API provider.

In this blog, we’ll compare the pros and cons of each approach, and provide you with a hands-on guide on how to make the best decision for your project and use case.

Bonus: a handy open source vs. API cheat sheet at the end!

Benefits of using OpenAI Whisper to develop your own ASR solution

The open source revolution in AI

The availability of open source code has been a major catalyst for the adoption of AI. Only a few years ago, it required a team of in-house specialists and computing resources to train AI models for relatively basic tasks. These days, however, engineers can simply connect to open source databases such as Hugging Face or GitHub and find the code they need to start building.

This is also true for ASR applications. Take Mozilla DeepSpeech, for instance, an open source ASR system available on GitHub that has been used for transcription services and voice assistants; or Kaldi, a widely used open source toolkit for speech recognition that provides a flexible and modular framework for building ASR systems. Kaldi has been used in many research projects and has also been adopted by several commercial speech recognition systems. Other popular open source tools include CMU Sphinx, Wav2Letter++, and Julius.

Why Whisper

Of all of the above, open-source Whisper was among the biggest breakthroughs in the field. Released by OpenAI in 2022, it has gained significant attention for its accuracy and versatility in speech recognition.

Its deep learning-based approach, fuelled by the power of the GPT-3.5 language model and sequence-to-sequence learning, has opened up a world of possibilities for commercial applications and use cases that rely on ASR technology – including in the previously uncharted multilingual domain.

Whisper ASR model diagram. The architecture follows the standard transformer-based encoder-decoder architecture.

Whisper empowers developers to create a diverse array of voice-enabled applications, ranging from transcription services and virtual assistants to hands-free controls and speech analytics.

Its open nature encourages collaboration within the developer community, promoting rapid innovation and customization to suit specific project requirements. Moreover, self-hosting is the only approach enabling full security and control over one’s data and infrastructure.

Based on an internal survey done among 225 product owners earlier this year, CTOs and CPOs, we found that about 40% opt for open source models, predominantly Whisper, for their STT solutions – which testifies to a clear value of this route for a range of business applications.

Yet, finding source code is only the start of the journey — adapting it to your specific use case is a whole different story, as it often requires additional fine-tuning and optimization. For companies who don’t have the time nor resources to achieve this, relying on open source alone may therefore not be the ideal move.

This is where the notion of the total cost of ownership comes in handy when deciding whether developing an ASR solution in-house using open source code is the right decision for your business and use case.

Limitations of Whisper ASR

Depending on their needs and use case, companies need to determine whether they have sufficient in-house AI and ML expertise to set up and maintain a model like Whisper in the long run. Otherwise, they risk starting something they cannot build, scale or fine-tune well enough to match their needs.

In a nutshell, there are three main limitations when it comes to building an in-house ASR solution using open source models such as Whisper.

Open source models are limited. Open source models, however groundbreaking, can be quite inflexible. To adapt them to a specific use case, additional fine-tuning and optimization is needed. For instance, Whisper’s multilingual abilities do not extend equally to all languages and features, and translation is limited to from any-to-English. Overcoming this and other shortcomings requires the use of proprietary algorithms and/or additional open source models.
Open source gets problematic at scale. Setting up and maintaining a neural network like Whisper at scale requires significant hardware resources. Whisper’s highest quality model, Large-v2, is highly intensive in both GPU and memory usage – not to mention, a degree of data science and engineering expertise required to make it production-grade, which goes far beyond that needed to train simpler machine learning models.
Open source can be very costly. While the cost of running CPUs and GPUs is relatively affordable (from 0.2$ per hour), there’s much more that goes into building your own ASR solution in-house using open source software. Once you add up the cost of human capital (hiring at least 2 senior programmers, a data scientist, and a project manager) to the hardware and maintenance cost of self-hosting, your total cost of ownership (TCO) can easily add up to $300k- $2 million per year. Here’s how we estimated that.

So is Whisper worth it?

Yes and no. It all depends on your specific use case and business needs.

Do you simply want to quickly create a product demo or conduct research on AI speech recognition? Then Whisper might be perfectly adequate for you.

However, for more complex use cases, such as call centers offering customer support or global media companies transcribing large volumes of content, hosting Whisper may not be the best option, as you’ll need to divert significant engineering resources, product staff and overall focus away from your primary product to build the extra functionalities needed.

For instance, Whisper is only able to process pre-recorded audio, so if you need real-time speech processing, you’ll need to devote a lot of resources to optimize it. The model also requires developers to split their own audio files into smaller chunks when the audio file exceeds 25 MB — which can be quite a hassle and brings down quality. Beyond the most popular languages, its performance is limited and requires custom fine-tuning – same for industry-specific jargon.

So while they often seem cost-effective to acquire at the beginning, hosting an open source model like Whisper can easily end up being costly at scale when you take into account the TCO, as well as its inherent ‘design limitations’.

Any company with 100+ hours of audio to transcribe per month would quickly begin to suffer under the financial burden of footing the bill for an in-house team of experts, plus increased GPU usage.

Alternative route: Getting your ASR solution through an API

There is an alternative to going the open source route, namely: picking an API provider.

What are speech-to-text APIs?

APIs are cloud-based services that provide developers with pre-built tools and interfaces to convert spoken language (audio or video) into written text. These APIs offer a convenient way to integrate AI speech recognition capabilities into your apps and platforms without the need to develop and maintain an ASR system from scratch. In a nutshell, it’s an all-batteries-included deal.

Speech-to-text APIs work by leveraging machine learning algorithms and large-scale training data to recognize and transcribe spoken words.

They typically employ a combination of traditional and deep learning models, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs) and transformer-based models, to process audio input and generate text output – as well to perform the more advanced functions like summarization or sentiment analysis, often requiring generative AI models.

Benefits of APIs in ASR

Superior performance thanks to proprietary optimization

When discussing the benefits of APIs, we need to address the speed and quality tradeoff in ASR. While a large neural network may be extremely accurate, it also takes longer to compute, since the more computations and the more processing time (and GPU power) is needed. Conversely, a simple algorithm may bring quick results, but suffer from low accuracy.

By using a hybrid architecture that combines the best of both worlds, APIs strike a balance between speed and quality while offering customizable options. That way, API providers can offer cost-effective solutions that cater to a wide range of user requirements.

Thanks to a proprietary approach to model optimization, Gladia is among the commercial vendors that have significantly improved Whisper’s base performance, and made it accessible to companies with high volumes of audio at a fraction of the cost.

More specifically, at Gladia we were able to achieve superior performance for Whisper in terms of both latency and accuracy, increase the volume and variety of input and output files/sizes, and expand the model’s scope from base transcription to audio intelligence add-ons, as well as translation to languages other than English. Our latests ASR system, Whisper-Zero, achieves all that while reducing virtually all base model's hallucinations.

No AI expertise or infrastructure required

Unlike building an ASR solution in house using open source code, APIs are easy to use, allowing developers without AI expertise to access ready-to-use services with simple API calls, eliminating the need to delve into the intricacies of speech recognition algorithms and infrastructure setup.

Moreover, they can be scaled more easily, since APIs are hosted in the cloud and can handle a high volume of requests, allowing applications to scale effortlessly.

Reduced time-to-market

Note that it will take you roughly a year to be production-ready if you choose to build a holistic Audio AI solution in-house. That’s one year during which your competitors are launching offers and getting customers, putting them at a competitive advantage. With an API, you can derive value from AI-powered features from day one of implementation.

Technical updates

You also need to factor the cost of future updates into your overall budget. Given the current pace of AI, a new model will go obsolete in less than three years, requiring additional capital reinjection in terms of both software and hardware.

With APIs, however, when it comes to maintaining and updating the ASR solution — including model improvements, bug fixes, and feature enhancements — this is all taken care of by the provider, freeing up valuable time and resources among your in-house developers.

Moreover, thanks to their extensive training data, ASR APIs often support multiple languages and dialects out of the box, eliminating the need for additional language-specific training.

Overall cost

Hosting any type of advanced speech-to-text solution by yourself can be a lot more costly than opting in for a pre-packaged API. Key reason: the cost of human capital.

After all, proper hosting requires at least two senior software developers – with salaries ranging from $50,000 – $88,000 per year. More realistically, it will take a 'two-pizza team' including a data scientist and project manager, to sustain a full-scale operation. On top of that, self-hosting comes with a range of hardware and maintenance costs — full breakdown here.

In contrast, pay-as-you-go formulas offered by API providers can be (significantly) cheaper. Based on our research, commercial pricing starts with $0.26 per hour of transcription, and goes all the way up to $1.44 for the Big Tech.

While the degree of quality varies greatly with each provider, APIs are generally more effective when you’re looking to easily scale your transcription volume and reduce your time to market.

All in all, APIs offer several benefits for companies that lack hardware and/or AI expertise, but still want to embed audio AI features into their product. Having an external vendor doing the pre-integration for you will save you time and money, allowing you to focus on delivering value from day one.

Open Source vs API: Ultimate Comparison

Whether to go the open source route or pick an API provider ultimately boils down to 4 key factors: available budget, level of in-house expertise, security requirement, and the volume of audio / video data you need to transcribe.

If you’re currently deciding between building your own in-house ASR solution or purchasing an API, here’s a cheat sheet we made with the main pros and cons of each approach.

Pros

Cons

API

Ease of use: Any developer can use an API without needing any additional AI expertise – saving you a lot in HR costs. AI doesn’t need to be central to your business for you to still harness the benefits of the underlying technology.

All batteries included: APIs tend to be powered by multiple highly advanced models
— often the best on the market – that are preselected, optimized for specific use cases, and updated regularly, enabling you to achieve optimal results.

Market-ready: Time-to-market is minimal with plug-and-play solutions. In other words, as an API user, you are able to supercharge your product with AI capabilities in a matter of minutes and derive value from it straight away.
Scalability: With APIs, you don’t need to commit to a specific volume of audio data in advance. You can scale at a reasonable cost as you grow, since the overall load is shared by multiple users.
Speed of use: AI APIs are designed to deliver speedy outputs, making them suitable for real-time use.

Data privacy: When it comes to privacy, in-cloud API hosting may not be suitable to high-confidentiality use cases. With every external provider, there’s always some risk of security breaches in the absence of due diligence.

High cost: While it is true that the absence of hardware requirements is good news for your budget, the STT API market doesn’t always strike the right quality-price balance for enterprise-grade clients, resulting in high costs for some use cases.
Dependency: Building in-house is a better option for those who do not wish to rely on an external provider.

Open source

Large open source database: The open source community has built impressive libraries offering many resources.

Potentially lower cost: In some cases, running a smaller model that is more limited in its application may allow for high performance for a specific use case at much lower cost than using a very large model provided as a service.
Full control: By running and maintaining open source models, organizations are not dependent on a third-party API service. Especially relevant when hosting is offline.

Complexity: A certain degree of AI expertise is needed to deploy open source in-house, combined with a sufficiently robust IT infrastructure to support it.

Narrower performance: Because commercial APIs tend to be powered by several, optimized models (incl. LLMs and generative AI models), replicating the same in-house with open source models — which tend to be smaller and more narrow in their scope — can be challenging.

Shorter life cycle: With open source, you don’t get updates, so you need to be ready to upgrade software and hardware every 2-3 years.

Significant TCO: All things considered, hosting your own model(s) is associated with high CAPEX, HR and maintenance costs.
Longer time-to-market: Since it takes about a year to be up and running, there is a significant opportunity cost associated with open source.

Source: Gladia, Dataiku

Want to learn more?

Here are some complimentary resources:

Whisper API vs. self-hosting form Nicolas' blog
Tutorial on deploying Whisper

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features, including speaker diarization and word-level timestamps.

Contact us

Your request has been registered

A problem occurred while submitting the form.

How real-time STT empowers multilingual support & unlocks international growth

Businesses expanding globally face an immediate language barrier. Customers want service in their native tongue, but most companies and call center providers don’t have enough multilingual agents to meet that demand.

Speech-To-Text

Live transcription made simple with Twilio, Python & Gladia

Live voice AI is no longer a concept of the future. From customer support to smart IVR (Interactive Voice Response) systems, speech is now transcribed in real time—often before the speaker finishes a sentence.

Product News

Getting started with Gladia: How to build with our STT API features

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.