Speech to Text: Chapter 3 - Speech Recognition with Open Source

Audio Processing Jun 4, 2021


In the previous article, we saw How to use Google Cloud Platform for doing Speech to Text tasks with state of the art accuracy.

In today's article, we will find out How to do speech to text task on-premise. The primary goal is to have a solution that should work offline (i.e. without internet connectivity).

Speech recognition research began in the early 1950s at Bell Labs. Initial implementation could work with only a single speaker and had limited vocabularies, about a dozen words. Modern speech recognition systems have evolved quite a lot since their older counterparts.

They can recognise speech from multiple speakers and have huge vocabulary in various languages. In addition to the above, they can also Auto-detect Languages on the fly.

Why Open Source?

Open Source software (OSS) has its benefits in today's technology. The most popular mobile OS - Android, powers billions of smartphones, all built on rock-solid open-source OS - Linux.

OSS collaboration not only ensures high quality of software but incentivises the community to collaborate and contribute.

Consider this, if a single corporation has to develop technologically advance software such as a speech recognition system. They would need to spend millions of dollars to design, build and maintain. OSS helps reduce those costs because of the community effort that goes in.

Without wasting much time, let's get right to it and see How it's done.

Demo Application

In our demo application, we are going to use VOSK framework that provides state of the art offline Speech to text capabilities and it achieves this by using advanced Deep learning under the hood. VOSK is an open-source speech recognition toolkit that is based on the Kaldi-ASR project.

The highlights of using VOSK are:

  1. Supports 18 languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian. More to come.
  2. Works offline, even on lightweight devices - Raspberry Pi, Android, iOS
  3. Installation is simple by just using - pip install vosk
  4. Portable per-language models are only 50Mb each, but there are much bigger server models available.
  5. Provides streaming API for the best user experience (unlike popular speech-recognition python packages)
  6. There are bindings for different programming languages, too - java/csharp/javascript etc.
  7. Allows Quick Reconfiguration of vocabulary for best accuracy.
  8. Supports Speaker Identification beside simple speech recognition.

Setup Instructions

VOSK-api comes with various sample scripts to test out its ASR capabilities, follow the below instructions for setup on your local machine -

git clone https://github.com/alphacep/vosk-api && cd vosk-api/python/example 
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip 
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 model 
python ./test_simple.py test.wave # or <PATH TO *.wav FILE>

Output -

Now that you have clone the Git Repo, How about we go ahead and test out the capabilities.

Step 1 - Import required libraries

from vosk import Model, KaldiRecognizer, SetLogLevel # vosk 
import sys # for fetching arg from CLI
import os # os operations
import wave # handle audio wav files

SetLogLevel(0) # debug level 0 for output on terminal

Step 2 - Checks whether model folder exists in the path or not

if not os.path.exists("model"): 
    "" Checks whether appropriate lanugage model exists"""
    print("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.") 

Step 3 - Load vosk pre-trained model and Initialise it

model = Model("model") # ASR (Automated Speech Recognition) VOSK Model is initialized 
rec = KaldiRecognizer(model, wf.getframerate()) 

Step 4 - Run the WAV file

while True: 
    """ Loop runs until the end of the wav file and output is printed on the screen."""
    data = wf.readframes(4000)
    if len(data) == 0:  # checks if the wave file has 0 bytes
    if rec.AcceptWaveform(data):

Step 5 - Enjoy your Results ...


Use cases

Such a system has immense potential in real world implementations such as -

  • IoT or edge devices - An IOT project can have tens and thousands of edge devices running in a network. And if an open-source solution suits your use case, it could lead to a lot of benefits from better latency to lower costs.
  • Reducing latency - Calling APIs on the cloud, especially for streaming data, can have higher latency as the servers are usually physically far off. Having an on-premise package can reduce latency significantly.
  • Lowering costs - Since there isn't any third-party cloud-based ASR provider involved, there are no additional costs for speech to text tasks.
  • ‌‌Data Privacy - If you have a use case involving sensitive data which you can't afford to share, Open-source ASR is a good choice. Data stays within the network and, privacy is at guard as the network owner is in total control.

Closing thoughts

We saw how easy it is to set up an Automated Speech Recognition system using the VOSK toolkit on our local machine.

VOSK provides different models based on language and dialect. We can pick as per our requirement or Intended target audience. There is a server implementation by the VOSK team that should help you get started for building chat bots or custom ASR systems.

Cloud-based ASR solutions from platforms like GCP, AWS and Azure can additionally provide out of the box features like a large pool of language models, higher accuracy etc. but, they would bump up your costs.

The choice you have to make while choosing an approach is purely based on What your business is aligned to and What is the Future Roadmap of your organisation.


STAY TUNED for more such content. 😁


Nikhil Akki

Full Stack AI Tinkerer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.