In the previous posts, we have understood the intuition and inner workings of Speech-To-Text (STT). There are different tools in the ecosystem to choose from, from Open Source to Cloud-based services. In this article, we will focus on using Google Cloud Services or GCP's Speech-to-Text service. As we all know, Google is the market leader in terms of search engine and mobile operating system - Android.
So much data is generated in Google's ecosystem that, over the years, it has developed and pioneered a lot of innovation in the space of AI, especially in Speech, Text and Computer Vision, among other areas.
Why Google Cloud Platform?
GCP is one of the leading players in the SaaS landscape for Speech-to-text services provided on the cloud. Due to its businesses in mobile computing and strong linguistic capabilities, they have developed superior AI capabilities. The first 60 minutes worth of translation is free per month and, GCP is one of the few platforms which gives you a trial period of 12 months along with $ 300 in credit which you can use on various services it offers.
To get started with creating an account on GCP, please click here.
Why not just go for Open Source?
You may be wondering why to go for the cloud when there are open source alternatives available. The answer is straightforward if you need higher accuracy in your speech to text conversation or support for various dialects or even multiple languages, the best place to get started is cloud.
There are two main reasons -
- Works out of the box -
All you need to do is call the write APIs from the cloud provider and, the Speech-to-Text capabilities are up and running.
- Multi-dialect and language support -
Cloud providers charge a fee not just for the hosting and server costs but, there's a lot of AI-based model training involved with lots and lots of data that isn't available in public (not for free for at-least) to train the Speech-to-text model.
Comparisons with other cloud solutions for Speech-to-text
Pro tip - Use pricing calculators when using any service from a cloud provider, it will help in getting a sense of potential costs.
Today we are going to build an app that will leverage GCP's speech-to-text APIs for audio transcription.
The app takes a URI as input and, the audio file has is be stored in GCP's Cloud Storage. The service runs on the cloud and, the audio transcription will return it as a response. The app prints the output on the result box. This UI layer is just a wrapper over the APIs. It can run with any other server application.
Code walk-throughSpeech to text code
from google.cloud import speech class GoogleSTT: def __init__( self, sample_rate_hertz: int = 16000, language_code: str = "en-US" ) -> None: # Instantiates a client self.client = speech.SpeechClient() self.sample_rate_hertz = sample_rate_hertz self.language_code = language_code def run_stt(self, gcs_uri: str): # The name of the audio file to transcribe self.audio = speech.RecognitionAudio(uri=gcs_uri) self.config = speech.RecognitionConfig( encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=self.sample_rate_hertz, language_code=self.language_code, ) # Detects speech in the audio file response = self.client.recognize(config=self.config, audio=self.audio) return response def get_transcript(self, response): transcript = [result.alternatives.transcript for result in response.results] return " ".join(transcript) def transcribe(self, gcs_uri: str): response = self.run_stt(gcs_uri) transcript = self.get_transcript(response) return transcript
Detailed instruction to run the above code please visit our GitHub page.
We saw how easy it is to bring Speech-to-text capabilities to your app. In the above example, we made a simple app that took an audio file as an input and displayed the text result on the screen. In the upcoming post, we will explore further other avenues available to us for doing Speech-to-text tasks.