In the previous article, we discussed the approach that is needed to implement an AI solution. That was a 1000 feet view of the AI process. In this article, we will explore all the steps needed to make your data suitable to be fed into the model of your choice.
“The ability to take data, to be able to understand it, to process it, to extract value from it, to visualise it, to communicate it, is going to be a hugely important skill in the next decades.”
— Hal Varian, Chief Economist, Google.
Why is Data so important?
To build an AI capability in a system, the most important ingredient is Data. It provides us the information needed for a better understanding and accurate decision-making process. Data provides us information, that can be converted into useful insights. These insights can help us improve our performance or help us make the right decisions.
But is it that simple?
It sounds simple right, we get the data from the source environment and use it to gain insights, feed it to Machine Learning architectures and voila, we have generated artificial intelligence. Do you think this can be that simple?
Of course not! There are multiple hidden steps involved in the process which are important for us to attain desired results. Let us dive into these steps in detail.
“Who has the data, has the power.” — Tim O’Reilly
There are multiple ways in which you can retrieve the relevant data. This data can be internal to your organisation or can be retrieved from some external source. The majority of times, the data is readily available. Sometimes the process is tedious and long where you have to actually go on the ground and spend a major amount of time collecting data.
One such example is PlantVillage, a project by Tensorflow which helped cassava farmers of Tanzania. Here the team collected over 50,000 high-quality images of the plant and annotate the images to train a single-shot object detection model, which was able to predict the disease the plant may have.
The first step is to identify useful data. While this may sound simple, but trust me it's not. Locating the appropriate source which can provide the data we need can be complex and frustrating. The data source must be validated to determine whether the data set is appropriate for use. This often seems like detective work or investigative reporting. The following points needs to be considered when selecting the data:
- Structure of the data (structured, unstructured, semistructured, table-based, proprietary)
- Source of the data (internal, external, private, public)
- Value of the data (generic, unique, specialised)
- Quality of the data (verified, static, streaming)
- Storage of the data (remotely accessed, shared, dedicated platforms, portable)
- Relationship of the data (superset, subset, correlated)
Once we have identified the source, we need to think about the storage of data. Data storage not only talks about the storage of raw data which was retrieved but also any knowledge base which was created.
Defining the ownership of the data is also an important step. There needs to be special attention given to authentication and authorisation when we talk about access to data. If the data is internal to your organisation, we need to take extra care to prevent the data leak. The one who holds the data holds the unexplored power that is contained in it.
Once you have stored the data, the next step will be to format it. Data is often like a diamond in the rough: it needs polishing to be of any use to you.
“Torture the Data and it will confess to anything.” — Warren Buffet
Your task now is to sanitise and prepare it for use in the modeling and reporting phase. Doing so is tremendously important because your models will perform better and you’ll lose less time trying to fix strange output. Your model needs the data in a specific format, so data transformation will always come into play. It’s a good habit to correct data errors as early on in the process as possible.
It helps you identify and fix the anomalies it has. These anomalies may have introduced in the data due to human error at the time of data collection, a faulty data device generating false readings, white-space characters in text data, null values, etc. These anomalies are needed to be identified as early as possible to prevent them affect the final results.
As it is practically impossible to store data in a single data source or single data table, your data may come from different data sources. There is a need to collect this data and visualise it together to understand it better. Combining data from different data sources is a complex and important step to be completed. Data vary in size, type, and structure, ranging from databases and Excel files to text documents. Let's talk about tabular data for simplicity's sake. There are various data combination steps you can perform like joining tables, appending tables, creating intersection tables, etc.
If data is being fetched from different data sources, there are solid chances that the data stored in all these sources will vary in the way they are stored. For example, the distance between two cities may be stored in kilo-meters in one of the systems and maybe in miles in some other system. This step plays an important role in these situations. We need to transform the data in the best way possible to make it uniform. This sub-step can also be of immense use when there is a need for Dimensionality reduction. Often there are features in your dataset that don't add any value to the data or are not relevant. We need to remove or transform such data points to make our data more meaningful.
Till now we have worked on validating our dataset and getting it ready to be processed. Now let's dive right into our data and try to understand what story it is trying to communicate.
“It is a capital mistake to theorise before one has data.” — Sherlock Holmes
This step is about exploring data, so keeping your mind and eyes open is essential during the exploratory data analysis phase. The goal isn’t to cleanse the data, but, commonly, you’ll still discover anomalies you missed before, and you may need to take a step back to fix them.
The single most widely used technique for data exploration is visualisation. Any information shown graphically makes the most impact and makes it easier for the readers to understand it better.
If you are provided customer data and asked to identify which age group is dominant in the data, which format will you prefer the data to be,
Option 1: An excel sheet with 1000 rows of data?
Option 2: A pie chart showing the distribution of customer age group?
Information becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to gain an understanding of your data and the interactions between variables. There is no set of steps that are to be followed when representing your data in graphical format. This process is very intuitive and requires a lot of dedication and time.
There are different ways to represent your data. It can be represented in form of a simple histogram showing the distribution of data points or a line chart, showing the trend followed by the data. Multiple variables can be visualised using some more complex graphs like scatter-plot or heat-map.
This is a fairly iterative process and to understand what data can be represented in which format, you need to perceive your data.
Though the majority of this task lies in visual representation, there are some non-graphical techniques as well which help us understand the data better. Clustering is one such example. If I talk about a market segmentation problem, the first approach that may come to your mind is clustering. We can perform simple k-means clustering and test our hypothesis and understand the distribution of data.
Data preparation takes up the majority of the Data Scientist's time, and fairly so, this is such an important step. The whole motive of this exercise is to identify and resolve all the issues with the data up-front so that it does not impact our results. There is no “One size fits all” in the world of data science. So it is also important not to be a slave to the process. Every problem statement needs to be handled in its unique way.
I hope this article explains the key steps needed when working with data. STAY TUNED for a follow-up article where we will understand how to make use of data and explore the different Machine Learning techniques. 😊