NLP Chapter 3: Spacy

NLP May 11, 2021

In our previous chapter we took a look at the various basic implementations of NLP using Google Cloud Platform.

Today we will be trying the same endeavour but with Open Source framework - Spacy.

Spacy is the leading Open Source framework, setting industrial standards in the field of Natural Language Processing. It benchmarks itself as the fastest and the most accurate engine for Tokenising, Tagging and Entity Extraction. It supports over 64+ languages and has additional integrations available for custom models from frameworks such as Tensorflow and PyTorch.

Some of the salient features of Spacy are as follows -

Image Reference - https://spacy.io/

Now that we are aware of the Magic Spacy can bring into the world of NLP and into your solutions, let us dig into HOW do we get started with it.

Step 1: Install the necessary Packages

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

For Installation via source or compiling as a binary, follow this link for further details.

Step 2: Understand and Test Linguistic Annotations

Linguistic Annotation or otherwise also knows as Syntactic Information provides us with Grammatical details of the given statement. It involves Part of Speech (POS) tags, Tokens, Dependency graph, Lemmatized words and more. This is regarded as one of the strongest features of NLP as it helps us understand the context, sentence structure and user's dialect better.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)
  

OUTPUT

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN advcl
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj

As you can see there is a wealth of information and analysis that can be drawn from such rich response. The output is not just limited to these parameters but there is an extensive set of attributes and functionalities offered by the same.

Let's take a look at the scope of features and attributes offered by Syntactic information by Spacy -

  • Text: The original word text.
  • Lemma: The base form of the word.
  • POS: The simple UPOS part-of-speech tag.
  • Tag: The detailed part-of-speech tag.
  • Dep: Syntactic dependency, i.e. the relation between tokens.
  • Shape: The word shape – capitalisation, punctuation, digits.
  • is alpha: Is the token an alpha character?
  • is stop: Is the token part of a stop list, i.e. the most common words of the language?

Step 3: Understand and Test Natural Entity Recognition (NER)

Entities are the whole and sole of NLP. When a model is sent a statement for prediction of NER it returns a subset of words with tags assigned to each one of them from pre-configured taxonomy of Entities.

As all these models are statistical in nature, the accuracy should never be expected to be 100%. The beauty of these technologies is that they can be trained for Custom Entities based on one's own personal requirements.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    

OUTPUT

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY

The attributes under the umbrella of NER offered by Spacy are as follow -

  • Text: The original entity text.
  • Start: Index of start of entity in the Doc.
  • End: Index of end of entity in the Doc.
  • Label: Entity label, i.e. type.

Step 4: Congratulate yourself on your success

If you were able to run all the above mentioned scripts and get the desired results, then you have officially been able to perform and understand some of the most difficult concepts involved in the field of NLP.

Now that we have successfully been able to perform some of the basic programming with Spacy, let's talk about the other advanced features offered by the framework.

Most of the real world problems and use cases require custom solutions and out of the box or pre-trained models don't quite suffice or satisfy the needs or match the expected numbers. Hence it becomes crucial to the application that it incorporates functionalities such as Custom Pipelines, Custom Training, Custom Entities and more.

Spacy has it all. Pipelines, Training, Entities, Linguistics, POS tagging and many more ...

Another noteworthy feature offered by Spacy is the Visualisation attribute, also known as Displacy. It is very important for every functionality to be visually impactful as well. Most developers or researchers in their exploration stage do not prefer looking at a Black and White screen. In such cases Visualisation tools create significant impact in understanding and developing one's skill of not only the framework but also of the subject.

Conclusion

Based on everything we have done by far in this micro series, now we should be able to Kickstart our adventure in the field of NLP. It's very important to understand the basic fundamental applications of NLP and the means by which the same can be implemented. With the vast buffet of resources available we have been successfully able to explore the usage of NLP via Google Cloud Platform and Open Source Framework - Spacy.

I hope this article helps you create POCs and leverage NLP in your applications to provide better usability and flexibility in user experience. We will be diving in deeper into NLP and Visualisations tools for the same in our upcoming series. STAY TUNED. 😁

Tags

Vaibhav Satpathy

AI Enthusiast and Explorer

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.