Speech to Text: Chapter 4 - AWS Transcribe

Audio Processing Aug 23, 2021

Being Cloud agnostic is very important in today's world, Having knowledge and being able to implement various services offered by the variety of Cloud Platforms at blazing fast speed becomes a Super Power in the 21st Century.

In our previous blog we covered -

  1. Speech to Text: Chapter 1 - Introduction
  2. Speech to Text: Chapter 2 - Google Cloud Platform
  3. Speech to Text: Chapter 3 - Open Source

In today's article we are going to cover one of the Leading Cloud Platforms for the Industry - Amazon Web Services (AWS)

Speech to Text acts as one of the most fundamental problem in the domain of Audio processing. The reason being; human dialect acts as the basic means of communication between 2 individuals and catering to a vast variety of dialects supporting over a population of 8 Billion and simultaneously providing satisfactory results becomes a huge hurdle to overcome.

AWS Transcribe a service offered by Amazon Web Services helps us in overcoming this hurdle to great extent.

Without beating around the bush for long, let's go ahead and explore How to perform Speech to Text using AWS Transcribe.

Step 0: Set up your Account

In order to setup your account for testing out AWS services in your local system follow the detailed documentation mentioned here.

Step 1: Import the necessary packages

import boto3
import time

Boto3 is the name for Python SDK offered by AWS. It pretty much encompasses all the services and their features those are needed to be accessed programatically.

Detailed Documentation for additional details on Boto3 can be found here.

Step 2: Record an Audio

Before we proceed further we need to understand the types of File formats supported by AWS for Transcription.

There are 2 kinds of Transcription service offered by AWS:

  1. Batch Transcription
  2. Streaming Transcription

In case of Batch Transcription the supported file formats are -

  1. FLAC
  2. MP3
  3. MP4
  4. Ogg
  5. WebM
  6. AMR
  7. WAV

It is recommended that you use a Lossless format such as FLAC or WAV and a frequency rate of 8000 Hz for Telephone Audio.

Also the Audio File recorded should be less than 4 hours or 2 GB in size.

For Streaming Transcription the support file formats are -

  1. FLAC
  2. Opus

As for the recommended support under those file formats, they remain the same as mentioned above.

Step 3: Create a Transcription Job

transcribe = boto3.client('transcribe')

job_name = "job name"
# TODO for the developer to provide the URI for the Recorded Audio
job_uri = "s3://DOC-EXAMPLE-BUCKET1/key-prefix/file.file-extension"

    Media={'MediaFileUri': job_uri},

while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
    print("Not ready yet...")
In case you don't have an Audio in the Cloud Storage, you can upload an Audio file into your Cloud Storage using the Console.

Step 4: Enjoy your Transcription

Below mentioned is a sample response for a Transcription Job created by you.

      "jobName":"job ID",
      "accountId":"account ID",
      "results": {
               "transcript":" that's no answer"


Congratulations, if you have followed the steps mentioned above you should have been able to perform your first Transcription using AWS Transcribe.

This article just takes you on a journey to implement basic transcription. In future articles we will cover additional concepts such as -

  1. Speaker Diarization - Identifying the Speakers
  2. Transcribing multi-channel audio
  3. Filtering Unwanted words
  4. Redaction of Personal Information
  5. Adding Custom Vocabulary and more ...

STAY TUNED for more interesting content on Audio Processing. 😁


Vaibhav Satpathy

AI Enthusiast and Explorer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.