Custom NLP: Chapter 3 - Entity Recognition using AutoML

NLP May 26, 2021

In our previous post Custom NLP: Chapter 2 - Data Annotation using Label Studio, we explored how to leverage Open Source tools to prepare our data for personalised training.

We ended our previous article on the note of exporting the annotated data onto the local system. In today's article we will see, HOW to utilise the same exported data and feed it into AutoML on Google Cloud Platform and use the Google Annotator to help us perform the same annotation task.

It is not necessary to perform the annotation again on GCP, you can transform the data into the expected format by AutoML and trigger the training process.

Without wasting much time, let's take a look at the necessary steps involved to setup AutoML Natural Language Model Training.

What is AutoML?

AutoML is a service provided by Google on Google Cloud Platform which enables developers with limited machine learning expertise to train high-quality models specific to their business needs.

You can train State of the Art models in minutes with just a few clicks.

It provides with an Interactive User Interface and extensive Developer documentation to automate the process of training along with providing great User Experience.

Without much a due, let's get right to it.

Step 1: Transform the data into the expected format

Google provides extensive documentation in terms of the templates expected by their services to help out developers to automate the process and reduce the Manual Effort needed to perform such mundane tasks.

Today we will be taking a look at how to setup AutoML for the task of Natural Entity Recognition.

As per our previous post, the exported data was in a JSON file format. As per AutoML NL, it expects the data with the following requirements -

  1. Required dataset should be in JSONL file format
  2. Dataset's metadata and segregation into TRAIN, TEST and VALIDATE is to be stored in a CSV file
  3. All the content related to AutoML should be uploaded to Google Cloud Storage

Let's take a look at the expected format of JSONL for AutoML NL for -

  1. Unannotated Data
    {"content": string}

2.   Annotated Data

  "annotations": [
      "text_extraction": {
         "text_segment": {
            "end_offset": number, "start_offset": number
       "display_name": string
       "text_extraction": {
          "text_segment": {
             "end_offset": number, "start_offset": number
        "display_name": string
    {"content": string}

You can find further documentation here. Now that we know the required JSONL format, let's go ahead and use the following code snippet to convert the exported JSON file.

import json

file_path = <path-of-label-studio-json-file>

with open(file_path, "r") as read_file:
    data = json.load(read_file)

train_data = []
for sentences in data:
    sentence = sentences.get("ner")
    labels = sentences.get("label")
    # for label in labels:
    #     text = label.get("text")
    #     gt = label.get("labels")[0]
    #     print(f"Text: {text} Ground Truth: {gt}")
    json_line = {"annotations": [], "text_snippet": {"content": sentence}}

with open("coa_train.jsonl", "w") as annot_train:
    for sentence in train_data:
        json.dump(sentence, annot_train)

The above code snippet will export a JSONL file as below in your working directory.

{"annotations": [], "text_snippet": {"content": "Avul Pakir Jainulabdeen Abdul Kalam was born on 15 October 1931. He was an Indian aerospace scientist who served as the 11th President of India from 2002 to 2007. He was born and raised in Rameswaram, Tamil Nadu and studied physics and aerospace engineering. He spent the next four decades as a scientist and science administrator, mainly at the Defence Research and Development Organisation (DRDO) and Indian Space Research Organisation (ISRO) and was intimately involved in India's civilian space programme and military missile development efforts. He thus came to be known as the Missile Man of India for his work on the development of ballistic missile and launch vehicle technology. He also played a pivotal organisational, technical, and political role in India's Pokhran-II nuclear tests in 1998, the first since the original nuclear test by India in 1974."}}

Once you have the expected JSONL file, we need to create the CSV file to be uploaded to GCS. The file format for the CSV file is -



Step 2: Create a GCS Bucket and upload the files

There are multiple ways in which you can create and upload to GCS. Let's take a look at how to achieve the task by using the GCP Console.

  1. Go to your console and search for Storage
  2. Click on Create Bucket and fill in the necessary details

3.  Once you have created the bucket import the JSONL and CSV files into it

Step 3: Setup your AutoML platform

Just as any AI problem, We need to first Create the Dataset and then Import and then TRAIN the model.

  1. Go to your GCP console
  2. Search for Natural Language and Open AutoML Datasets
  3. Click on Create Dataset

4.  Once you have created the Dataset, Import all the data from your GCS

Step 4: Patience

Now that we have created the Dataset, Imported it, we have to wait until the Import finishes.

Once the import of dataset is completed, we move on to our next milestone in our JOURNEY of Custom NLP - Data Annotation using Google Annotator.

As a part of this article , we learnt HOW to transform Data from Label Studio for AutoML, HOW to setup AutoML Natural Language for Importing Dataset and the various Configuration available with AutoML.

I hope this article helps to kickstart your journey. In our next post we will be covering HOW to ANNOTATE, TRAIN and DEPLOY models into production using AutoML. STAY TUNED 😁


Vaibhav Satpathy

AI Enthusiast and Explorer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.