NLP Chapter 4: AWS Comprehend

NLP Sep 1, 2021

In our previous posts we explored the some of the various platforms that can be leveraged to perform Natural Language Processing.

Today's article is about one of those leading Cloud Platforms - Amazon Web Services.

Instead of spending too much time walking you through again What is NLP? Instead how about you take a brief look at NLP at the article mentioned below.

Natural Language Processing: Chapter 1
Who doesn’t feel comfortable in their own skin? The truth bespoken everyone likes to interact and converse in a language that they are well versed with. For animals it’s phonetics, whereas for us humans it’s a slightly evolved version of the same. Over the centuries, the human language has seen

Now without wasting much of our precious time, how about we take a look at the various components of NLP that we will be covering in today's Chapter -

  1. Natural Entity Recognition (NER)
  2. Key Phrase Detection
  3. Personal Identification Information (PII) Detection
  4. Sentiment Analysis
  5. Syntactic Information Detection

In case if you are not versed with the above terms, need not worry as we will diving deeper into each one of them through the course of this article.

Step 0: Setup your Account

If you already have an account and have setup your work space, feel free to skip this step. For all the first timers please follow the link below to setup your account.

AWS CLI and SDK - Setup for Devs
Before one embarks on a Journey to explore the plethora of services, heading towards innovation, the biggest obstacle for any developer is setting up their system. Sometimes going through vast expanse of documentation can prove to be extremely exhaustive. This article aims at bringing together all t…

Step 1: Install the Client Library

For this article we will be using Boto3, the Python SDK offered by AWS for developers.

For detailed documentation of the same, you can find it right here.

pip3 install boto3

Step 2: Import the necessary Packages

import boto3

client = boto3.client("comprehend")

Step 3: Open your Code editor and get a cup of Coffee

text = "It is raining today in Seattle"

Test 1: Natural Entity Recognition

As the name suggests it is meant to identify the entities within a particular sentence. So what are entities?

Entities are tags provided to a particular word within a sentence based on general classification and grammar of the respective Language.

What are the various entities those are covered?

Well AWS covers a plethora of entity types when it comes to NLP. Some of the entities under their Comprehend Service are -

TypeDescription

COMMERCIAL_ITEM

A branded product

DATE

A full date (for example, 11/25/2017), day (Tuesday), month (May), or time (8:30 a.m.)

EVENT

An event, such as a festival, concert, election, etc.

LOCATION

A specific location, such as a country, city, lake, building, etc.

ORGANIZATION

Large organizations, such as a government, company, religion, sports team, etc.

OTHER

Entities that don't fit into any of the other entity categories

PERSON

Individuals, groups of people, nicknames, fictional characters

QUANTITY

A quantified amount, such as currency, percentages, numbers, bytes, etc.

TITLE

An official name given to any creation or creative work, such as movies, books, songs, etc.
def detect_entities(text: str):
    response = client.detect_entities(Text=text, LanguageCode='en')
    
    for entities in response.get("Entities"):
        print(f"text: {entities.get("Text")}")
        print(f"confidence_score: {entities.get("Score")}")
        print(f"entity_type: {entities.get("Type")}")
    return response

Test 2: Detect Key Phrases

A key phrase is a string containing a noun phrase that describes a particular thing. It generally consists of a noun and the modifiers that distinguish it.

Usually they are the words which are stressed upon in a statement to highlight certain features of either the Subject, Object or the Verb involved in the sentence.

def detect_entities(text: str):
    response = client.detect_key_phrases(Text=text, LanguageCode='en')
    
    for phrases in response.get("KeyPhrases"):
        print(f"text: {phrases.get("Text")}")
        print(f"confidence_score: {phrases.get("Score")}")
    return response

Test 3: Detect Personal Identification Information (PII)

PII information play a significant role when you. are dealing with Enterprise grade Solutions.

Security of an Individual's Personal Information is critical, as they are vital to the privacy of an individual and also if exposed can lead to over exploitation of the same by a Hacker.

AWS covers a large set of entities under its vocabulary of PII banner. Some of the Entities detected by Comprehend are -

PII entity typeDescription
ADDRESS

A physical address, such as "100 Main Street, Anytown, USA" or "Suite #12, Building 123". An address can include a street, building, location, city, state, country, county, zip, precinct, neighborhood, and more.

AGE

An individual's age, including the quantity and unit of time. For example, in the phrase "I am 40 years old," Amazon Comprehend recognizes "40 years" as an age.

AWS_ACCESS_KEY

A unique identifier that's associated with a secret access key; the access key ID and secret access key are used together to sign programmatic AWS requests cryptographically.

AWS_SECRET_KEY

A unique identifier that's associated with an access key; the access key ID and secret access key are used together to sign programmatic AWS requests cryptographically.

BANK_ACCOUNT_NUMBER

A US bank account number. These are typically between 10 - 12 digits long, but Amazon Comprehend also recognizes bank account numbers when only the last 4 digits are present.

BANK_ROUTING

A US bank account routing number. These are typically 9 digits long, but Amazon Comprehend also recognizes routing numbers when only the last 4 digits are present.

CREDIT_DEBIT_CVV

A 3-digit card verification code (CVV) that is present on VISA, MasterCard, and Discover credit and debit cards. In American Express credit or debit cards, it is a 4-digit numeric code.

CREDIT_DEBIT_EXPIRY

The expiration date for a credit or debit card. This number is usually 4 digits long and formatted as month/year or MM/YY. For example, Amazon Comprehend can recognize expiration dates such as 01/21, 01/2021, and Jan 2021.

CREDIT_DEBIT_NUMBER

The number for a credit or debit card. These numbers can vary from 13 to 16 digits in length, but Amazon Comprehend also recognizes credit or debit card numbers when only the last 4 digits are present.

DATE_TIME

A date can include a year, month, day, day of week, or time of day. For example, Amazon Comprehend recognizes "January 19, 2020" or "11 am" as dates. Amazon Comprehend will recognize partial dates, date ranges, and date intervals. It will also recognize decades, such as "the 1990s".

DRIVER_ID

The number assigned to a driver's license, which is an official document permitting an individual to operate one or more motorized vehicles on a public road. A driver's license number consists of alphanumeric characters.

EMAIL

An email address, such as marymajor@email.com.

IP_ADDRESS

An IPv4 address, such as 198.51.100.0.

MAC_ADDRESS

A media access control (MAC) address is a unique identifier assigned to a network interface controller (NIC).

NAME

An individual's name. This entity type does not include titles, such as Mr., Mrs., Miss, or Dr. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the "John Doe Organization" as an organization, and it recognizes "Jane Doe Street" as an address.

PASSPORT_NUMBER

A US passport number. Passport numbers range from 6 - 9 alphanumeric characters.

PASSWORD

An alphanumeric string that is used as a password, such as "*very20special#pass*".

PHONE

A phone number. This entity type also includes fax and pager numbers.

PIN

A 4-digit personal identification number (PIN) that allows someone to access their bank account information.

SSN

A Social Security Number (SSN) is a 9-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last 4 digits are present.

URL

A web address, such as www.example.com.

USERNAME

A user name that identifies an account, such as a login name, screen name, nick name, or handle.

def detect_pii_entities(text: str):
    response = client.detect_pii_entities(Text=text, LanguageCode='en')
    
    for entities in response.get("Entities"):
        print(f"confidence_score: {entities.get("Score")}")
        print(f"entity_type: {entities.get("Type")}")
    return response

Test 4: Sentiment Analysis

Sentiment Analysis as the name suggests, it id to gauge out the mindset of the person speaking the sentence.

It is categorised at a high level into 4 categories -

  1. POSITIVE
  2. NEGATIVE
  3. MIXED
  4. NEUTRAL
def detect_sentiment(text: str):
    response = client.detect_sentiment(Text=text, LanguageCode='en')

    print(f"sentiment: {response.get("Sentiment")}")
    print(f"sentiment score: {response.get("SentimentScore")}")
    return response

Test 5: Detect Syntax

Every sentence when spoken is critically built based on rules of English. It has Grammar, Adjectives, Prepositions, Dependencies, Subject-Object-Verb Agreement and a lot more details put into it.

Many a times while building complex NLP solutions it becomes extremely necessary to get a grasp of Part of Speech (POS) tags linked to the words in the statement. For Natural Language Querying, it plays a very trivial role in understanding the dependencies and context of the words within the statement.

def detect_syntax(text: str):
    response = client.detect_syntax(Text=text, LanguageCode='en')

    for tokens in response.get("SyntaxTokens"):
        print(f"TokenId: {tokens.get("TokenId")}")
        print(f"Text: {tokens.get("Text")}")
        print(f"PartOfSpeech: {tokens.get("PartOfSpeech")}")
        print(f"PartOfSpeechTag: {tokens.get("PartOfSpeech").get("Tag")}")
        print(f"PartOfSpeechScore: {tokens.get("PartOfSpeech").get("Score")}")
        
    return response

Every platform has its own set of schema set for the kind of Syntactic Information it wants to provide. For AWS -

TokenPart of speech
ADJ

Adjective

Words that typically modify nouns.

ADP

Adposition

The head of a prepositional or postpositional phrase.

ADV

Adverb

Words that typically modify verbs. They may also modify adjectives and other adverbs.

AUX

Auxiliary

Function words that accompanies the verb of a verb phrase.

CCONJ

Coordinating conjunction

Words that links words or phrases without subordinating one to the other.

DET

Determiner

Articles and other words that specify a particular noun phrase.

INTJ

Interjection

Words used as an exclamation or part of an exclamation.

NOUN

Noun

Words that specify a person, place, thing, animal, or idea.

NUM

Numeral

Conclusion

Congratulations! If you have followed the steps mentioned above, you have successfully covered most of the critical use cases that you may ever face while working with Natural Language Processing.

Well All thanks to AWS Comprehend.

In case if the curious mind within you is not satisfied and you wish to explore more. Detailed documentation for the same can be found below.

Comprehend - Amazon Comprehend
Describes how Amazon Comprehend works.

I hope this article finds you well. For advanced content on Natural Language Processing STAY TUNED 😁.

Tags

Vaibhav Satpathy

AI Enthusiast and Explorer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.