In the previous chapter, we talked in detail about what is Natural Language Processing and we saw that the generic model does not suit all our requirements. Which brought us to the notion of having Custom trained NLP models. We also talked about the various steps involved before we can start the training process. We started by introducing various steps involved in the journey. We spent some time understanding the nuances of Data Strategies.
This brings us to the next step in the process, i.e. Data Annotation.
What is Data Annotation?
Data annotation is a process where users assign metadata to the data. This is useful in preparing training datasets for all kinds of supervised learning problems such as object detection, entity recognition, image classification, image segmentation, etc.
Data annotation is an integral part of the machine learning model training process, as the neural network we are trying to train, actually learns from the pattern that exists in the annotated data and tries to identify such patterns in new unseen data.
Data annotation usually is a very painful task for any Data Scientist. But many available tools in the market make this process somewhat easier and more intuitive. We will explore one such tool in this article — Label Studio.
What is Label Studio?
Label studio is an Open source data annotation tool. It gives you the option to annotate data on your local or server using a simple and intuitive user interface. It gives you the option to select from various types of problem statements to choose from like image classification, NER tagging, Image segmentation, etc.
Today we will explore different features of Label studio with a simple example. We will be annotating text for Named Entity Recognition. Without wasting time, let’s dive right in.
Installation and Startup
There are various ways Label studio can be installed in your system. We will be using a simple Python pip package. You need to have Python 3.6 or above installed in your system. Run the below command on your terminal to install the package.
python3 -m venv env source env/bin/activate python -m pip install label-studio
We suggest you make use of a virtual environment to reduce the chance of a version miss-match of various libraries installed.
Post-installation, you can directly run the below command to start Label Studio on your local system.
Label studio instance will get started on http://localhost:8080. Navigate to the URL and check if the instance has started.
Data annotation cannot be performed by any one person. This task usually requires collaborative efforts from all the team members working. Label studio provides us a way to manage users and distribute the tasks between the team member.
You can signup multiple users from the UI and then can log in using the same credentials. When we deploy this service on the cloud and different users use it, you can control the user access using various methods. Please refer to the documentation for more details on this.
Import Data from local system
After signup, we need to create a project to start. Click on "Create Project" and give the name of your project. Once you have created a new project, we can import our dataset. Navigate to the "Data Import" tab. Here you have multiple options to upload the data. You can either upload the data by clicking on the "Upload Files" button, or you can drag and drop the files on the page. In our example, we have uploaded a text file. Various file formats are supported here.
The next step will be to select the "labeling setup". There are multiple setups available to choose from.
For our example, we have taken a text file on which we will be performing Named Entity Recognition.
Start your annotations
Now we start with our annotation process. For Named Entity Recognition, we would need to assign various entity tags to the text data which we have. These chunks of text with their respective entity tags are then used by our neural network to identify a generalized pattern.
Export the annotations
After annotation is completed, the annotated data can be exported in the desired format from UI. Multiple export formats are supported by Label Studio. Click on the "Export" button and get the result downloaded.
We have seen how Label Studio simplifies the tedious and time-consuming task for data annotations. This solution can be scaled by deploying it on a server and collaborating with the whole team to complete the task more efficiently and accurately.
This is one of many tools which are available right now in the market for Data annotation. Some are open source while others are cloud-native. In upcoming chapters, we will discuss some cloud-native tools such as Google Annotator, available for same.
STAY TUNED for more content on Custom NLP in our future posts. 😁