Let's understand the situation first and then define one-hot encoding. Sometimes solving a problem is one of the fastest ways to understand the concepts. Alright, let's create a situation first (I just made up the situation).
Let's say you are solving a simple data science problem. Now, it doesn't matter what the actual problem is about but you are caught up in a situation where you have a tiny data set which has 7 instances and each of this instance has 4 features. In lame words, the data set has 7 rows and 4 columns. Out of which the three columns are of type object
meaning those columns comprise string values. The other column is of type int
meaning it has only integer values. Now enough talk let's practically see how the data set looks like. Rather than showing you the raw data (.CSV format). I formatted it into a data frame using the pandas library.
To be on the safer side, let's see the data types of the columns.
Now the actual situation starts since some learning algorithms work only with numeric data you have to somehow deal with this object data. There are two options to deal with this situation:
I know option 1 works well, but sometimes you have to focus and work hard for a living. Now the solution to this situation is to convert this object
type of data into several binary
ones. What I mean by this is look at the data set very closely. The column Favourite Color has 6 unique values such as Red, Orange, Yellow, Green, Purple, and Blue. Now we can transform this feature into a vector of six numerical values as shown below:
Similarly don't you think we can transform the Favourite Day column into a vector of six numerical values too? Because there are 7 unique days in this column such as Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.
Now you might be thinking what about the Attitude column can we not do the same. But the catch point here is no. Don't do the same. Here we shall learn a new concept called order. Since there are decent, good, better, best and excellent. We shall order them as {decent, good, better, best, excellent} as {1, 2, 3, 4, 5} or {0, 1, 2, 3, 4}.
This is because when the ordering of some values matters we can replace those values by keeping only one variable.
Remember this technique does not work in all the cases. For example, some of you might think that can't we use the same technique to fill the values for the other two columns too. By doing so you will certainly decrease the dimensions of the feature vector but it implies that there is an order among the values in that category and it will often confuse the learning algorithm. The learning algorithm will try to find a state or regularity when there is no one and the algorithm will most likely overfit. So think and use this technique wisely. Use this only when the order of the values is important. This technique can be used in the cases of quality of an article, user reviews of a product, taste of food, etc.
So knowingly or unknowingly you have learned and mastered the concept of One-hot encoding, and where to use it. This is how you convert the categorical or object type data into numeric type data. Let us see how we can actually code this and come out of this situation.
As mentioned earlier, this is a made-up dataset. Created for this tutorial. Nothing personal.
import pandas as pd
# Creating a list with some values
studentID = [1000, 1001, 1002, 1003, 1004, 1005, 1006]
color = ['Red', 'Orange', "Yellow", 'Green', 'Yellow', 'Purple', 'Blue']
DaysOfTheWeek = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
Attitude = ['Best', 'Decent', 'Better', 'Excellent', 'Excellent', 'Good', 'Best']
Now that we have the list let's convert this into a data frame. To do this we need to zip all the list values and then store it.
# Converting the list into a data frame and simultaneously renaming the columns.
df = pd.DataFrame(list(zip(studentID, color, DaysOfTheWeek, Attitude)), columns =['Student ID', 'Favourite Color', 'Favourite Day', 'Attitude'])
df
Student ID | Favourite Color | Favourite Day | Attitude | |
---|---|---|---|---|
0 | 1000 | Red | Monday | Best |
1 | 1001 | Orange | Tuesday | Decent |
2 | 1002 | Yellow | Wednesday | Better |
3 | 1003 | Green | Thursday | Excellent |
4 | 1004 | Yellow | Friday | Excellent |
5 | 1005 | Purple | Saturday | Good |
6 | 1006 | Blue | Sunday | Best |
This is because in most cases you might get a categorical type of data. But in this, all the three as seen above is of an object
type. If this is the case with you then you need to manually convert them to categorical type.
# Converting the object type data into categorical data column
for col in ['Favourite Color','Favourite Day', 'Attitude']:
df[col] = df[col].astype('category')
df.dtypes
Student ID int64 Favourite Color category Favourite Day category Attitude category dtype: object
As discussed we will be transforming only the Favourite Color and Favourite Day columns to its binary value columns. Rather than manually doing this we can use the pandas get_dummies
method.
# Assigning the binary values for Favourite Day and Favourite Color columns
df = pd.get_dummies(data=df,columns=['Favourite Color','Favourite Day'])
df
Student ID | Attitude | Favourite Color_Blue | Favourite Color_Green | Favourite Color_Orange | Favourite Color_Purple | Favourite Color_Red | Favourite Color_Yellow | Favourite Day_Friday | Favourite Day_Monday | Favourite Day_Saturday | Favourite Day_Sunday | Favourite Day_Thursday | Favourite Day_Tuesday | Favourite Day_Wednesday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000 | Best | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 1001 | Decent | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 1002 | Better | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 1003 | Excellent | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 1004 | Excellent | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 1005 | Good | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
6 | 1006 | Best | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
By doing so you will obviously increase the dimension of your data set, but your learning algorithm will perform a lot more better.
There are two ways you can do this:
LabelEncoder
methodOption 1 is just of no use because what if you have more than 1000 unique values then you might use a looping statement and make your life complicated. It's 2020 think smart and use the sklearn
library to do this.
# Assigning order to the categorical column
from sklearn.preprocessing import LabelEncoder
# Initializing an object of class LabelEncoder
labelencoder = LabelEncoder()
df['Attitude'] = labelencoder.fit_transform(df['Attitude'])
df
Student ID | Attitude | Favourite Color_Blue | Favourite Color_Green | Favourite Color_Orange | Favourite Color_Purple | Favourite Color_Red | Favourite Color_Yellow | Favourite Day_Friday | Favourite Day_Monday | Favourite Day_Saturday | Favourite Day_Sunday | Favourite Day_Thursday | Favourite Day_Tuesday | Favourite Day_Wednesday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 1001 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 1002 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 1003 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 1004 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 1005 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
6 | 1006 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
There you go, now you can use your favorite learning algorithm and then tell fit(X, y) or whatever and sleep happily.
Alright, guys, I hope you have learned something new today. This is really a very important concept and feature engineering technique that you will come across. This is one of the most commonly asked questions during data science interviews. If you have any doubts regarding this tutorial then the comment section is all yours. Until then stay safe, goodbye. I will see you next time.