AWS Textract - Document Processing

AWS Oct 6, 2021
OCR - Optical Character Recognition

A word that has been floating around in the era of Digital Transformation.

But what do we mean by OCR?

Simply put, it is the technology or means to transform any written information from physical to digital realm.

Over the centuries humanity has accumulated incomprehensible volumes of Knowledge. Keeping a track of every bit of information was practically impossible until now.

Thanks to OCR we are now able to read and digitalise data at blazing fast speeds and not just that, transforming unstructured information into structured format and storing them has made extraction and filteration from such massive knowledge base a walk in the park.

So if OCR is the solution, then what is Textract?

What is Textract?

Amazon Textract is a machine learning (ML) service that uses OCR to automatically extract text, handwriting, and data from scanned documents such as PDFs.

Well that's the standard description of it. Now let's understand what is it truly?

It is a Document Processing engine, that has the capability to extract not just text using OCR but also re-structure the extracted information into Form and Tabular Data.

One of the major highlights of the same is that it has specialised models for extrapolation of data from Invoices and Receipts other than its standard Form parser.

Why do we need it?

For many organisations on a daily basis they process volumes of Documents coming in from Vendors, Contractors, Labour sheets, Time summary and a lot more.

It becomes extremely tedious to manually organise (Classify) the documents and extract and filter relevant information from the same.

In such cases, products such as AWS Textract come in really handy to accelerate the process of extraction.

Ideally parsing through a table to Time Summary, what would have taken may be around 5-10 mins now barely takes a few seconds.

Also as this tool leverages AI capabilities State of the Art performance under most circumstances is assured, also can be integrated with RPA (Robotic Process Automation) bots to streamline End to End pipeline.

How do we use it?

Just like any other AWS service, this too can be accessed via 2 methods primarily.

  1. Using Console
  2. Using Client SDK

As a part of today's article we will cover leveraging Textract using the console.

Pre-requisite - You should have an AWS Account and as this is part of Free Tier so no money would be charged for testing out first 1000 documents.

Once you Login to your AWS console, in the search bar type - Textract and follow the steps as mentioned below.

Conclusion

Congratulations you have successfully leveraged Textract for Document extraction.

As you could see, the whole process is extremely intuitive and the experience is seamless. The OCR accuracy and the transformation to create structured data out of the unstructured uploaded document works wonders.

In our next post we will be exploring how to perform the same Document analysis using Client Libraries and How to leverage Specialised models.

STAY TUNED 😁

Tags

Vaibhav Satpathy

AI Enthusiast and Explorer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.