AWS Textract - Developers View

AWS Oct 8, 2021

In our previous post we touched the Tip of the Ice berg for Document Processing using Textract.

But all we learnt was the What and Why of it.

In today's article we will dive deeper and understand How to leverage Textract and incorporate it for Automation using Client Libraries.

Without wasting much time, let's get right to it.

Step 0: Setup your AWS Account

If you already have an account and have setup your work space, feel free to skip this step. For all the first timers please follow the link below to setup your account.

AWS CLI and SDK - Setup for Devs
Before one embarks on a Journey to explore the plethora of services, heading towards innovation, the biggest obstacle for any developer is setting up their system. Sometimes going through vast expanse of documentation can prove to be extremely exhaustive. This article aims at bringing together all t…

Step 1: Install Client Library

pip3 install boto3

Step 2: Import the necessary packages

import boto3

client = boto3.client("textract")

For this article we will be using Boto3, the Python SDK offered by AWS for developers.

For detailed documentation of the same, you can find it right here

Step 3: Read your Document

Before we get deeper into scripting out the solution, let's understand a couple of things first.

There are 2 types of calls that can be made to Textract for Document Extraction -

  1. Synchronous
  2. Asynchronous

In synchronous calls, the documents passed need to be in either JPEG/PNG format and need to be passed in Bytes format. Also in addition to that there is a file limit of 5 mb.

Whereas if you are using Asynchronous calls, the documents passed can be JPEG/PNG/PDF formats. One necessity would be that the documents need to be in Cloud Storage (S3).

On top of that as it would be an async call, it would return with a Job ID, that can be used to poll with to know the status of Job completion.

It is recommended that we use Async calls for large documents or batch processing.

Now let's get right to it...

def read_image_file(path: str):
    with open(path, "rb") as image_file:
        content = image_file.read()
    return {"Bytes": content}

Step 4: Call your Client Library

Now all you are left with is to read the image and Call the Boto3 SDK to get the response.

# Developer to pass their Document's Image
image_path = ""

image_content = read_image_file(path=image_path)

response = client.analyze_document(
    Document=image_content,
    FeatureTypes=["FORMS"],
)

# In case you want both Tables and Forms in response
# response = client.analyze_document(
#     Document=image_content,
#     FeatureTypes=["FORMS", "TABLES"],
# )

# Get the output Response
print(response)

The above mentioned example was for a situation where you want to process utilise standard Model offered by Textract for Document Extraction and the image is on your Local System.

Step 5: Specialised Models

In this step we will call the specialised model for invoice or receipt type of document to performed enhance extraction.

In addition to that we would also use a sample Document that we have uploaded into S3 Cloud Storage.

# User to provide their own Bucket Name and File Path
response = client.analyze_expense(
    Document={
        "S3Object": {
            "Bucket": "documents-dataset",
            "Name": "invoice.jpg",
        }
    }
)

# Get the output Response
print(response)

As you can see in the previous step we called analyze_expense instead of analyze_document.

Thereby making a call to the specialised model for Invoice / Receipt type of Documents.

Step 6: Explore

In case you are already comfortable with how Textract works and feel that Synchronous calls are just causing a bottleneck to your process, below is a link to an implementation of Async call for Document Extraction

Note - There is no support for Specialised models in Async mode.
Processing Documents with Asynchronous Operations - Amazon Textract
Overview of Amazon Textract analysis.

Conclusion

Congratulations! You have successfully implemented a scalable Document Processing engine with specialised models for Invoices and Receipts.

In our future posts we will cover more End-to-End approach of how we can leverage Cloud native services integrated with Textract to make it Enterprise Grade.

STAY TUNED 😁

Tags

Vaibhav Satpathy

AI Enthusiast and Explorer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.