Learning One-Hot Encoding in Python the Easy Way¶

In this tutorial, we will learn one of the important concepts in feature engineering know as one-hot encoding from scratch.¶

alt text

Let's understand the situation first and then define one-hot encoding. Sometimes solving a problem is one of the fastest ways to understand the concepts. Alright, let's create a situation first (I just made up the situation).

Situation¶

Let's say you are solving a simple data science problem. Now, it doesn't matter what the actual problem is about but you are caught up in a situation where you have a tiny data set which has 7 instances and each of this instance has 4 features. In lame words, the data set has 7 rows and 4 columns. Out of which the three columns are of type object meaning those columns comprise string values. The other column is of type int meaning it has only integer values. Now enough talk let's practically see how the data set looks like. Rather than showing you the raw data (.CSV format). I formatted it into a data frame using the pandas library.

alt text

To be on the safer side, let's see the data types of the columns.

alt text

Now the actual situation starts since some learning algorithms work only with numeric data you have to somehow deal with this object data. There are two options to deal with this situation:

Delete all the three columns and then go to sleep
Read this tutorial and implement one-hot encoding

I know option 1 works well, but sometimes you have to focus and work hard for a living. Now the solution to this situation is to convert this object type of data into several binary ones. What I mean by this is look at the data set very closely. The column Favourite Color has 6 unique values such as Red, Orange, Yellow, Green, Purple, and Blue. Now we can transform this feature into a vector of six numerical values as shown below:

alt text

Similarly don't you think we can transform the Favourite Day column into a vector of six numerical values too? Because there are 7 unique days in this column such as Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.

alt text

Now you might be thinking what about the Attitude column can we not do the same. But the catch point here is no. Don't do the same. Here we shall learn a new concept called order. Since there are decent, good, better, best and excellent. We shall order them as {decent, good, better, best, excellent} as {1, 2, 3, 4, 5} or {0, 1, 2, 3, 4}.

This is because when the ordering of some values matters we can replace those values by keeping only one variable.

Remember this technique does not work in all the cases. For example, some of you might think that can't we use the same technique to fill the values for the other two columns too. By doing so you will certainly decrease the dimensions of the feature vector but it implies that there is an order among the values in that category and it will often confuse the learning algorithm. The learning algorithm will try to find a state or regularity when there is no one and the algorithm will most likely overfit. So think and use this technique wisely. Use this only when the order of the values is important. This technique can be used in the cases of quality of an article, user reviews of a product, taste of food, etc.

So knowingly or unknowingly you have learned and mastered the concept of One-hot encoding, and where to use it. This is how you convert the categorical or object type data into numeric type data. Let us see how we can actually code this and come out of this situation.

Creating the dataset from scratch¶

As mentioned earlier, this is a made-up dataset. Created for this tutorial. Nothing personal.

In [0]:

import pandas as pd

# Creating a list with some values 

studentID = [1000, 1001, 1002, 1003, 1004, 1005, 1006]
color = ['Red', 'Orange', "Yellow", 'Green', 'Yellow', 'Purple', 'Blue']
DaysOfTheWeek = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
Attitude = ['Best', 'Decent', 'Better', 'Excellent', 'Excellent', 'Good', 'Best']

Now that we have the list let's convert this into a data frame. To do this we need to zip all the list values and then store it.

In [21]:

# Converting the list into a data frame and simultaneously renaming the columns.

df = pd.DataFrame(list(zip(studentID, color, DaysOfTheWeek, Attitude)), columns =['Student ID', 'Favourite Color', 'Favourite Day', 'Attitude'])
df

Out[21]:

	Student ID	Favourite Color	Favourite Day	Attitude
0	1000	Red	Monday	Best
1	1001	Orange	Tuesday	Decent
2	1002	Yellow	Wednesday	Better
3	1003	Green	Thursday	Excellent
4	1004	Yellow	Friday	Excellent
5	1005	Purple	Saturday	Good
6	1006	Blue	Sunday	Best

Converting the object type data into the categorical type¶

This is because in most cases you might get a categorical type of data. But in this, all the three as seen above is of an object type. If this is the case with you then you need to manually convert them to categorical type.

In [22]:

# Converting the object type data into categorical data column

for col in ['Favourite Color','Favourite Day', 'Attitude']:
    df[col] = df[col].astype('category')
df.dtypes

Student ID            int64
Favourite Color    category
Favourite Day      category
Attitude           category
dtype: object

Assigning the binary codes to the categorical values¶

As discussed we will be transforming only the Favourite Color and Favourite Day columns to its binary value columns. Rather than manually doing this we can use the pandas get_dummies method.

In [23]:

# Assigning the binary values for Favourite Day and Favourite Color columns

df = pd.get_dummies(data=df,columns=['Favourite Color','Favourite Day'])
df

Out[23]:

	Student ID	Attitude	Favourite Color_Blue	Favourite Color_Green	Favourite Color_Orange	Favourite Color_Purple	Favourite Color_Red	Favourite Color_Yellow	Favourite Day_Friday	Favourite Day_Monday	Favourite Day_Saturday	Favourite Day_Sunday	Favourite Day_Thursday	Favourite Day_Tuesday	Favourite Day_Wednesday
0	1000	Best	0	0	0	0	1	0	0	1	0	0	0	0	0
1	1001	Decent	0	0	1	0	0	0	0	0	0	0	0	1	0
2	1002	Better	0	0	0	0	0	1	0	0	0	0	0	0	1
3	1003	Excellent	0	1	0	0	0	0	0	0	0	0	1	0	0
4	1004	Excellent	0	0	0	0	0	1	1	0	0	0	0	0	0
5	1005	Good	0	0	0	1	0	0	0	0	1	0	0	0	0
6	1006	Best	1	0	0	0	0	0	0	0	0	1	0	0	0

By doing so you will obviously increase the dimension of your data set, but your learning algorithm will perform a lot more better.

Assigning orders to the categorical column called "Attitude"¶

There are two ways you can do this:

Manually assigning values using a dictionary.
Using LabelEncoder method

Option 1 is just of no use because what if you have more than 1000 unique values then you might use a looping statement and make your life complicated. It's 2020 think smart and use the sklearn library to do this.

In [24]:

# Assigning order to the categorical column 

from sklearn.preprocessing import LabelEncoder

# Initializing an object of class LabelEncoder

labelencoder = LabelEncoder() 
df['Attitude'] = labelencoder.fit_transform(df['Attitude'])
df

Out[24]:

	Student ID	Attitude	Favourite Color_Blue	Favourite Color_Green	Favourite Color_Orange	Favourite Color_Purple	Favourite Color_Red	Favourite Color_Yellow	Favourite Day_Friday	Favourite Day_Monday	Favourite Day_Saturday	Favourite Day_Sunday	Favourite Day_Thursday	Favourite Day_Tuesday	Favourite Day_Wednesday
0	1000	0	0	0	0	0	1	0	0	1	0	0	0	0	0
1	1001	2	0	0	1	0	0	0	0	0	0	0	0	1	0
2	1002	1	0	0	0	0	0	1	0	0	0	0	0	0	1
3	1003	3	0	1	0	0	0	0	0	0	0	0	1	0	0
4	1004	3	0	0	0	0	0	1	1	0	0	0	0	0	0
5	1005	4	0	0	0	1	0	0	0	0	1	0	0	0	0
6	1006	0	1	0	0	0	0	0	0	0	0	1	0	0	0

There you go, now you can use your favorite learning algorithm and then tell fit(X, y) or whatever and sleep happily.

Alright, guys, I hope you have learned something new today. This is really a very important concept and feature engineering technique that you will come across. This is one of the most commonly asked questions during data science interviews. If you have any doubts regarding this tutorial then the comment section is all yours. Until then stay safe, goodbye. I will see you next time.