Speech to Text: Chapter 1

Audio Processing May 31, 2021
Let me ask you a question, What do you think is the happiest moment for a Mother?
It's when her Baby calls her out for the first time.

Audio, Speech, Vocals are one of the most fundamental way of response to a similar stimuli for most living organisms. That being said, it is still the most arduous task to tackle for any Digital Being or in our world also known as Artificial Intelligence.

In the recent years AI has taken leaps into the future in not only Understanding Human Speech but also performing actions based on commands and context of the speech. Well if one were to try such a feat from scratch, they would have to go through many steps of AI Engineering, starting with What is Speech to Text?

What is Speech to Text?

In simple words, converting Audio to Text in a human readable format would justify Speech to Text.

But that's just a layman's definition. Now if you are reading this article, you are looking to LEVEL UP your game and be more than just a beginner. So then what is speech-to-text?

At an over-arching level the above statement holds true, but still there are many nuances involved in speech-to-text a.k.a STT. For example -

  1. Audio Transcription
  2. Speaker Diarization
  3. Language Translation
  4. Speech Correction
  5. Audio Contextualisation

And many more ...

As you can clearly see, STT is not only about converting Audio to Text but also to understand user's Context, Language, Correct it and most importantly to be able to Identify and Segregate from other background noise.

How can we achieve it?

As a protocol we know that for any AI based problem we require volumes of data to train the model to achieve satisfactory results. But that's not what this Series is all about.

We will be exploring the various avenues available to us, to leverage Cloud native and Open source platforms to perform not just Speech to text but to also explore Crowd Sourced Data along with annotations.

So without wasting much time let us dive into the Building Blocks of an STT based application.

Let's see the MAGIC

How about we take a sample, upload it to Google Speech to text and see what it transcribes. But you must be wondering but I didn't train it or I don't have the credits to use the services.

Don't worry Google provides with a free Interactive Dashboard to test out their services within a LIMITED capacity ofcourse.

Now just go ahead and open Google Speech-to-text.


Now that we have a clearer understanding of what is needed to perform Speech-to-text and what it is capable of, How about we go ahead and try it out for ourselves.

In our next posts we will be exploring the different Building Blocks and the power of leveraging CLOUD and Open Source to achieve State of the Art Results.


Vaibhav Satpathy

AI Enthusiast and Explorer

