A! in Audio

Audio Processing May 17, 2021


Audio or a sound is among the first things we create as Humans when we are born and, it is common across other animals and birds. We might not have a clear vision at the moment of birth but, the sense of sound is always present for most living beings.

Sound has been an integral part of human development and it is as important as vision or eyesight, if not more. The human civilisations wouldn't have developed as much as it has if it wasn't for our ability to use sound in the form of speech.

Why is Sound fascinating?

Sound has the power to alter human emotions and, it can create anything from sadness to joy. If you close your eyes and sit at a particular place for a few days, at a specific time and on some other occasion, you are come to the same spot; blindfolded, you would still recognise that place. Patterns of sound are registered in our brains and, that is how we recognise places, animals or other creatures with their sounds.

Our ability to perceive sound is phenomenal and how our brain interprets it. Most eastern languages gave special significance to the sound. The language has developed from a deep understanding of each sound a human could produce.

The Sanskrit language is considered the mother of all Indian languages is one such language that is the epitome of human's understanding & mastery over sound.

What is Audio in Digital form?

Digital Audio is a representation of a sound that is encoded in numeric values in a continuous sequence.

The audio sequence is stored with two parameters -

  1. Bit rate 16/24 bits decides the amount of the information is stored per second
  2. The frequency, which measures the amplitude levels that we humans can perceive (i.e. 20 Hz to 22.05 kHz)

We can hear frequencies between 20 Hz to 22.05 kHz. Double the amount of frequency required to recreate the sound in a computer. Hence, the general CD Audio has a 44.1 kHz frequency.

Audio Processing Basics

ADC (Analogue to Digital Converter) & DAC (Digital to Analogue Converter)

It is a five-step process out of which the first three steps are part of the audio ingestion and conversion to digital format and, the rest of the two deals with the transformation of digital to analogue.

Intuition for Speech Recognition

Speech recognition is one of the use cases in AI for Audio and easier to visualise.

Let's look at an example we want to classify two sounds -

Since these are two different words so obviously, each one will sound unique to our ears. Now, we will visualise both words with the help of a Python script.

import librosa
from librosa.display import waveplot, specshow

y1, sr1 = librosa.load("hello.ogg")
y2, sr2 = librosa.load("world.ogg")

Above code snippet loads the hello.ogg & world.ogg audio files using the library - Librosa.

Waveform of 'hello.ogg'
Waveform of 'world.ogg'

The different kind of sound will have different frequency plot. However, this still won't let us uniquely identify these sounds. We will use what is called a feature extractor or feature transformer.

def generate_chromagram(y: np.ndarray):
    # Separate harmonics and percussives into two waveforms
    y_harmonic, y_percussive = librosa.effects.hpss(y)

    # Compute chroma features from the harmonic signal
    chromagram = librosa.feature.chroma_cqt(y=y_harmonic,

Above code snippet generates a chromagram and plots it

Do not worry if you aren't familiar with the concept or the framework. We will cover them in our future posts in more detail.

In the below example, we have used the chroma feature or chromagram plot to show unique features that can help distinguish between sounds.

chromagram of Hello.ogg
chromagram of world.ogg

We can see that both sound files have a unique imprint. However, we won't try to distinguish these sounds visually. AI algorithms can help us and do tasks like audio classification, similarity search etc.

Some use cases for AI in Audio -

  1. Speech Recognition: Siri, Alex, Cortana and Google Assistant are a few of the mainstream applications or digital assistants of leverage speech recognition as the primary form of interaction with its user.
  2. Music Generation: Companies use AI to create, enhance, and complement the music content industry.
  3. Music Streaming: Machine learning and deep learning techniques enable streaming platforms to recommend personalised content based on data from user activity.
  4. Speech to Text: Automates the process of note-taking during an interview or a meeting. It enables transcription services less labour intensive.
  5. Speech Generation: In 2018, Google unveiled their state of the art speech generation tech and, the AI was able to book a haircut appointment. The person on the phone couldn't distinguish whether it was a human or an AI making the call!
  6. Audio classification: Some apps help you identify artists even with a few seconds of the audio clip as an input.

This list is just a handful of use cases and, there are more out there. In this series, we will cover how we can get hands-on with Audio data and start building stuff on our own.

Closing thoughts -

We have seen how sound is an essential part of being human and, AI gives us the ability to mimic and produce human-like behaviour from machines. AI empowers us to help solve the challenges of our times. We will build on the concepts learnt so far in Audio processing works and how to get started with audio files using the Python library - Librosa. We will continue to build on this and look at some in-depth concepts in our upcoming posts.

References -


Nikhil Akki

Full Stack AI Tinkerer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.