Speech to Text: Chapter 2 - Google Cloud Platform

Audio Processing Jun 2, 2021

Introduction

In the previous posts, we have understood the intuition and inner workings of Speech-To-Text (STT). There are different tools in the ecosystem to choose from, from Open Source to Cloud-based services. In this article, we will focus on using Google Cloud Services or GCP's Speech-to-Text service. As we all know, Google is the market leader in terms of search engine and mobile operating system - Android.

So much data is generated in Google's ecosystem that, over the years, it has developed and pioneered a lot of innovation in the space of AI, especially in Speech, Text and Computer Vision, among other areas.

Why Google Cloud Platform?

GCP is one of the leading players in the SaaS landscape for Speech-to-text services provided on the cloud. Due to its businesses in mobile computing and strong linguistic capabilities, they have developed superior AI capabilities. The first 60 minutes worth of translation is free per month and, GCP is one of the few platforms which gives you a trial period of 12 months along with $ 300 in credit which you can use on various services it offers.

To get started with creating an account on GCP, please click here.

Why not just go for Open Source?

You may be wondering why to go for the cloud when there are open source alternatives available. The answer is straightforward if you need higher accuracy in your speech to text conversation or support for various dialects or even multiple languages, the best place to get started is cloud.

There are two main reasons -

  • Works out of the box -
All you need to do is call the write APIs from the cloud provider and, the Speech-to-Text capabilities are up and running.
  • Multi-dialect and language support -
Cloud providers charge a fee not just for the hosting and server costs but, there's a lot of AI-based model training involved with lots and lots of data that isn't available in public (not for free for at-least) to train the Speech-to-text model.

Comparisons with other cloud solutions for Speech-to-text

Visit pricing page of respective cloud providers for latest pricing
Pro tip - Use pricing calculators when using any service from a cloud provider, it will help in getting a sense of potential costs.

Demo App

Today we are going to build an app that will leverage GCP's speech-to-text APIs for audio transcription.

Built with magicgui in Python

Sample Audio


The app takes a URI as input and, the audio file has is be stored in GCP's Cloud Storage. The service runs on the cloud and, the audio transcription will return it as a response. The app prints the output on the result box. This UI layer is just a wrapper over the APIs. It can run with any other server application.

Code walk-through

Speech to text code

from google.cloud import speech


class GoogleSTT:
    def __init__(
        self, sample_rate_hertz: int = 16000, language_code: str = "en-US"
    ) -> None:

        # Instantiates a client
        self.client = speech.SpeechClient()
        self.sample_rate_hertz = sample_rate_hertz
        self.language_code = language_code

    def run_stt(self, gcs_uri: str):
        # The name of the audio file to transcribe
        self.audio = speech.RecognitionAudio(uri=gcs_uri)

        self.config = speech.RecognitionConfig(
            encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
            sample_rate_hertz=self.sample_rate_hertz,
            language_code=self.language_code,
        )

        # Detects speech in the audio file
        response = self.client.recognize(config=self.config, audio=self.audio)
        return response

    def get_transcript(self, response):
        transcript = [result.alternatives[0].transcript for result in response.results]
        return " ".join(transcript)

    def transcribe(self, gcs_uri: str):
        response = self.run_stt(gcs_uri)
        transcript = self.get_transcript(response)
        return transcript
  
Detailed instruction to run the above code please visit our GitHub page.

The GoogleSTT class implements a couple of helper functions that invoke the Speech-to-Text API provided by Google SDK's Speech package. The service supports various audio file formats like raw, wav, flac, mp3 etc. One has to make sure that the sampling rate and encoding format is set as per the input file. All the internal nitty-gritty has been taken care of by the speech package. There are client libraries available for various other programming languages like JavaScript, GoLang etc.

Conclusion

We saw how easy it is to bring Speech-to-text capabilities to your app. In the above example, we made a simple app that took an audio file as an input and displayed the text result on the screen. In the upcoming post, we will explore further other avenues available to us for doing Speech-to-text tasks.

Tags

Nikhil Akki

Full Stack AI Tinkerer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.