Handling Categorical Data in Python¶

What does Nominal, Ordinal and Continuous features mean?¶

Categorical features can only take on a limited, and usually fixed, number of possible values. For example, if a dataset is about information related to users, then you will typically find features like country, gender, age group, etc. Alternatively, if the data you're working with is related to products, you will find features like product type, manufacturer, seller and so on.

These are all categorical features in your dataset. These features are typically stored as text values which represent various traits of the observations. For example, gender is described as Male (M) or Female (F), product type could be described as electronics, apparels, food etc.

Note that these type of features where the categories are only labeled without any order of precedence are called nominal features. Features which have some order associated with them are called ordinal features.

For example, a feature like economic status, with three categories: low, medium and high, which have an order associated with them.

There are also continuous features. These are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or a date/time.

Regardless of what the value is used for, the challenge is determining how to use this data in the analysis because of the following constraints:

Categorical features may have a very large number of levels, known as high cardinality, (for example, cities or URLs), where most of the levels appear in a relatively small number of instances.
Many machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them.
While some ML packages or libraries might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs.
For the machine, categorical data doesn’t contain the same context or information that humans can easily associate and understand. For example, when looking at a feature called City with three cities New York, New Jersey and New Delhi, humans can infer that New York is closely related to New Jersey as they are from same country, while New York and New Delhi are much different. But for the model, New York, New Jersey and New Delhi, are just three different levels (possible values) of the same feature City. If you don’t specify the additional contextual information, it will be impossible for the model to differentiate between highly different levels.

You therefore are faced with the challenge of figuring out how to turn these text values into numerical values for further processing and unmask lots of interesting information which these features might hide. Typically, any standard work-flow in feature engineering involves some form of transformation of these categorical values into numeric labels and then applying some encoding scheme on these values.

Encoding Categorical Data¶

Existing Encoding Methods (all modules for which code is available, see: http://contrib.scikit-learn.org/categorical-encoding/_modules/index.html)

category_encoders.backward_difference
category_encoders.basen
category_encoders.binary
category_encoders.hashing
category_encoders.helmert
category_encoders.leave_one_out
category_encoders.one_hot
category_encoders.ordinal
category_encoders.polynomial
category_encoders.sum_coding
category_encoders.target_encoder

The techniques that you'll cover are the following:

Replacing values
Encoding labels
One-Hot encoding
Binary encoding
Backward difference encoding
Polynomial encodings
Miscellaneous features

1. Replace Values¶

In [1]:

import pandas as pd
import numpy as np
import copy
%matplotlib inline

In [2]:

df_flights = pd.read_csv('https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv')

df_flights.head()

Out[2]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	minute
0	2014	1	1	1.0	96.0	235.0	70.0	AS	N508AS	145	PDX	ANC	194.0	1542	1.0
1	2014	1	1	4.0	-6.0	738.0	-23.0	US	N195UW	1830	SEA	CLT	252.0	2279	4.0
2	2014	1	1	8.0	13.0	548.0	-4.0	UA	N37422	1609	PDX	IAH	201.0	1825	8.0
3	2014	1	1	28.0	-2.0	800.0	-23.0	US	N547UW	466	PDX	CLT	251.0	2282	28.0
4	2014	1	1	34.0	44.0	325.0	43.0	AS	N762AS	121	SEA	ANC	201.0	1448	34.0

In [3]:

cat_df_flights = df_flights.select_dtypes(include=['object']).copy()
cat_df_flights.head()

Out[3]:

	carrier	tailnum	origin	dest
0	AS	N508AS	PDX	ANC
1	US	N195UW	SEA	CLT
2	UA	N37422	PDX	IAH
3	US	N547UW	PDX	CLT
4	AS	N762AS	SEA	ANC

In [4]:

print(cat_df_flights.isnull().values.sum())

In [5]:

cat_df_flights = cat_df_flights.fillna(cat_df_flights['tailnum'].value_counts().index[0])
print(cat_df_flights.isnull().values.sum())

In [6]:

replace_map = {'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7 , 'UA': 8 , 'US': 9,'VX': 10,'WN': 11}}

In [7]:

labels = cat_df_flights['carrier'].astype('category').cat.categories.tolist()
replace_map_comp = {'carrier' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}

print(replace_map_comp)

{'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7, 'UA': 8, 'US': 9, 'VX': 10, 'WN': 11}}

In [8]:

cat_df_flights_replace = cat_df_flights.copy()

In [9]:

cat_df_flights_replace.replace(replace_map_comp, inplace=True)

print(cat_df_flights_replace.head())

   carrier tailnum origin dest
0        2  N508AS    PDX  ANC
1        9  N195UW    SEA  CLT
2        8  N37422    PDX  IAH
3        9  N547UW    PDX  CLT
4        2  N762AS    SEA  ANC

In [10]:

print(cat_df_flights_replace['carrier'].dtypes)

int64

In [11]:

cat_df_flights_lc = cat_df_flights.copy()
cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].astype('category')
cat_df_flights_lc['origin'] = cat_df_flights_lc['origin'].astype('category')                                                              

print(cat_df_flights_lc.dtypes)

carrier    category
tailnum      object
origin     category
dest         object
dtype: object

In [12]:

import time
%timeit cat_df_flights.groupby(['origin','carrier']).count() #DataFrame with object dtype columns

31.3 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]:

%timeit cat_df_flights_lc.groupby(['origin','carrier']).count() #DataFrame with category dtype columns

21.6 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Label Encoding¶

In [14]:

cat_df_flights_lc.head()

Out[14]:

	carrier	tailnum	origin	dest
0	AS	N508AS	PDX	ANC
1	US	N195UW	SEA	CLT
2	UA	N37422	PDX	IAH
3	US	N547UW	PDX	CLT
4	AS	N762AS	SEA	ANC

In [15]:

cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].cat.codes

In [16]:

cat_df_flights_lc.head() #alphabetically labeled from 0 to 10

Out[16]:

	carrier	tailnum	origin	dest
0	1	N508AS	PDX	ANC
1	8	N195UW	SEA	CLT
2	7	N37422	PDX	IAH
3	8	N547UW	PDX	CLT
4	1	N762AS	SEA	ANC

In [17]:

cat_df_flights_specific = cat_df_flights.copy()
cat_df_flights_specific['US_code'] = np.where(cat_df_flights_specific['carrier'].str.contains('US'), 1, 0)

cat_df_flights_specific.head()

Out[17]:

	carrier	tailnum	origin	dest	US_code
0	AS	N508AS	PDX	ANC	0
1	US	N195UW	SEA	CLT	1
2	UA	N37422	PDX	IAH	0
3	US	N547UW	PDX	CLT	1
4	AS	N762AS	SEA	ANC	0

In [18]:

cat_df_flights_sklearn = cat_df_flights.copy()

from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
cat_df_flights_sklearn['carrier_code'] = lb_make.fit_transform(cat_df_flights['carrier'])

cat_df_flights_sklearn.head() #Results in appending a new column to df

Out[18]:

	carrier	tailnum	origin	dest	carrier_code
0	AS	N508AS	PDX	ANC	1
1	US	N195UW	SEA	CLT	8
2	UA	N37422	PDX	IAH	7
3	US	N547UW	PDX	CLT	8
4	AS	N762AS	SEA	ANC	1

Label encoding is pretty much intuitive and straight-forward and may give you a good performance from your learning algorithm, but it has as disadvantage that the numerical values can be misinterpreted by the algorithm. Should the carrier US (encoded to 8) be given 8x more weight than the carrier AS (encoded to 1) ?

To solve this issue there is another popular way to encode the categories via something called one-hot encoding.

3. One-Hot encoding¶

In [19]:

cat_df_flights.head()

Out[19]:

	carrier	tailnum	origin	dest
0	AS	N508AS	PDX	ANC
1	US	N195UW	SEA	CLT
2	UA	N37422	PDX	IAH
3	US	N547UW	PDX	CLT
4	AS	N762AS	SEA	ANC

In [20]:

cat_df_flights_onehot = cat_df_flights.copy()
cat_df_flights_onehot = pd.get_dummies(cat_df_flights_onehot, columns=['carrier'], prefix = ['carrier'])

cat_df_flights_onehot.head()

Out[20]:

	tailnum	origin	dest	carrier_AS	carrier_UA	carrier_US
0	N508AS	PDX	ANC	1	0	0
1	N195UW	SEA	CLT	0	0	1
2	N37422	PDX	IAH	0	1	0
3	N547UW	PDX	CLT	0	0	1
4	N762AS	SEA	ANC	1	0	0

In [21]:

cat_df_flights_onehot_sklearn = cat_df_flights.copy()

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
lb_results = lb.fit_transform(cat_df_flights_onehot_sklearn['carrier'])
lb_results_df = pd.DataFrame(lb_results, columns=lb.classes_)

lb_results_df.head()

Out[21]:

	AS	UA	US
0	1	0	0
1	0	0	1
2	0	1	0
3	0	0	1
4	1	0	0

In [22]:

result_df = pd.concat([cat_df_flights_onehot_sklearn, lb_results_df], axis=1)

result_df.head()

Out[22]:

	carrier	tailnum	origin	dest	AS	UA	US
0	AS	N508AS	PDX	ANC	1	0	0
1	US	N195UW	SEA	CLT	0	0	1
2	UA	N37422	PDX	IAH	0	1	0
3	US	N547UW	PDX	CLT	0	0	1
4	AS	N762AS	SEA	ANC	1	0	0

4. Binary Encoding¶

In [23]:

cat_df_flights_ce = cat_df_flights.copy()

import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['carrier'])
df_binary = encoder.fit_transform(cat_df_flights_ce)

df_binary.head()

Out[23]:

	carrier_3	carrier_4	tailnum	origin	dest
0	0	1	N508AS	PDX	ANC
1	1	0	N195UW	SEA	CLT
2	1	1	N37422	PDX	IAH
3	1	0	N547UW	PDX	CLT
4	0	1	N762AS	SEA	ANC

In [24]:

df_binary[df_binary['carrier_0']==1]

Out[24]:

	carrier_0	carrier_1	carrier_2	carrier_3	carrier_4	tailnum	origin	dest

5. Backward Difference Encoding¶

In [25]:

encoder = ce.BackwardDifferenceEncoder(cols=['carrier'])
df_bd = encoder.fit_transform(cat_df_flights_ce)

df_bd.head()

Out[25]:

	col_carrier_0	col_carrier_1	col_carrier_2	col_carrier_3	col_carrier_4	col_carrier_5	col_carrier_6	col_carrier_7	col_carrier_8	col_carrier_9	col_carrier_10	col_tailnum	col_origin	col_dest
0	1.0	-0.909091	-0.818182	-0.727273	-0.636364	-0.545455	-0.454545	-0.363636	-0.272727	-0.181818	-0.090909	N508AS	PDX	ANC
1	1.0	0.090909	-0.818182	-0.727273	-0.636364	-0.545455	-0.454545	-0.363636	-0.272727	-0.181818	-0.090909	N195UW	SEA	CLT
2	1.0	0.090909	0.181818	-0.727273	-0.636364	-0.545455	-0.454545	-0.363636	-0.272727	-0.181818	-0.090909	N37422	PDX	IAH
3	1.0	0.090909	-0.818182	-0.727273	-0.636364	-0.545455	-0.454545	-0.363636	-0.272727	-0.181818	-0.090909	N547UW	PDX	CLT
4	1.0	-0.909091	-0.818182	-0.727273	-0.636364	-0.545455	-0.454545	-0.363636	-0.272727	-0.181818	-0.090909	N762AS	SEA	ANC

In [26]:

np.unique(df_bd['col_carrier_1'])

Out[26]:

array([-0.90909091,  0.09090909])

6. Polynomial Encoding¶

In [27]:

encoder = ce.PolynomialEncoder(cols=['carrier'])
df_bd = encoder.fit_transform(cat_df_flights_ce)

df_bd.head()

Out[27]:

	col_carrier_0	col_carrier_1	col_carrier_2	col_carrier_3	col_carrier_4	col_carrier_5	col_carrier_6	col_carrier_7	col_carrier_8	col_carrier_9	col_carrier_10	col_tailnum	col_origin	col_dest
0	1.0	-0.476731	0.512092	-0.458029	0.354787	-0.240192	0.141610	-0.071707	0.030334	-0.010141	0.002326	N508AS	PDX	ANC
1	1.0	-0.381385	0.204837	0.091606	-0.354787	0.480384	-0.453153	0.329853	-0.188069	0.081127	-0.023265	N195UW	SEA	CLT
2	1.0	-0.286039	-0.034139	0.335888	-0.354787	0.080064	0.273780	-0.473267	0.442872	-0.273805	0.104692	N37422	PDX	IAH
3	1.0	-0.381385	0.204837	0.091606	-0.354787	0.480384	-0.453153	0.329853	-0.188069	0.081127	-0.023265	N547UW	PDX	CLT
4	1.0	-0.476731	0.512092	-0.458029	0.354787	-0.240192	0.141610	-0.071707	0.030334	-0.010141	0.002326	N762AS	SEA	ANC

In [28]:

np.unique(df_bd['col_carrier_1'])

Out[28]:

array([-4.76731295e-01, -3.81385036e-01, -2.86038777e-01, -1.90692518e-01,
       -9.53462589e-02, -1.16688970e-17,  9.53462589e-02,  1.90692518e-01,
        2.86038777e-01,  3.81385036e-01,  4.76731295e-01])

7. Miscellaneous Features¶

In [29]:

dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})
dummy_df_age['start'], dummy_df_age['end'] = zip(*dummy_df_age['age'].map(lambda x: x.split('-')))

dummy_df_age.head()

Out[29]:

	age	start	end
0	0-20	0	20
1	20-40	20	40
2	40-60	40	60
3	60-80	60	80

In [30]:

dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})

def split_mean(x):
    split_list = x.split('-')
    mean = (float(split_list[0])+float(split_list[1]))/2
    return mean

dummy_df_age['age_mean'] = dummy_df_age['age'].apply(lambda x: split_mean(x))

dummy_df_age.head()

Out[30]:

	age	age_mean
0	0-20	10.0
1	20-40	30.0
2	40-60	50.0
3	60-80	70.0

Dealing with Categorical Features in Big Data with Spark¶

The first step in Spark programming is to create a SparkContext. SparkContext is required when you want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. You'll start by importing SparkContext.
To start working with Spark DataFrames, you first have to create a SparkSession object from your SparkContext.

1st way¶

In [ ]:

#import findspark

#findspark.init()

#import pyspark

#confspark = pyspark.SparkConf().setMaster("local[*]").set("spark.cores.max", "4").set("spark.executor.memory", "2G").setAppName("--test--")

#sc = pyspark.SparkContext(conf=confspark)

#sc._conf.getAll()

#from pyspark.sql import SparkSession 

#spark = SparkSession(sc) 

#sc.stop()

2nd way¶

In [31]:

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession 

#confspark = pyspark.SparkConf().setMaster("local[4]").set("spark.cores.max", "4").set("spark.executor.memory", "2G").setAppName("--test--")
#spark = SparkSession.builder.config(conf=confspark).getOrCreate()

spark = SparkSession.builder.master("local[*]").appName("--test--").config("spark.some.config.option", "some-value").getOrCreate()

In [32]:

spark.version

Out[32]:

'2.3.1'

In [33]:

spark.catalog.listTables()

Out[33]:

[]

In [34]:

spark_flights = spark.read.format("csv").option('header',True).load('data/flights.csv',inferSchema=True)
spark_flights.show(3)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|    1|  1|       1|       96|     235|       70|     AS| N508AS|   145|   PDX| ANC|     194|    1542|   0|     1|
|2014|    1|  1|       4|       -6|     738|      -23|     US| N195UW|  1830|   SEA| CLT|     252|    2279|   0|     4|
|2014|    1|  1|       8|       13|     548|       -4|     UA| N37422|  1609|   PDX| IAH|     201|    1825|   0|     8|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
only showing top 3 rows

In [35]:

spark_flights.printSchema()

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)

In [36]:

spark.catalog.listTables()

Out[36]:

[]

In [37]:

spark_flights.createOrReplaceTempView("flights_temp")

In [38]:

spark.catalog.listTables()

Out[38]:

[Table(name='flights_temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [39]:

carrier_df = spark_flights.select("carrier")
carrier_df.show(5)

+-------+
|carrier|
+-------+
|     AS|
|     US|
|     UA|
|     US|
|     AS|
+-------+
only showing top 5 rows

StringIndexer¶

In [40]:

from pyspark.ml.feature import StringIndexer

carr_indexer = StringIndexer(inputCol="carrier",outputCol="carrier_index")
carr_indexed = carr_indexer.fit(carrier_df).transform(carrier_df)

carr_indexed.show(7)

+-------+-------------+
|carrier|carrier_index|
+-------+-------------+
|     AS|          0.0|
|     US|          6.0|
|     UA|          4.0|
|     US|          6.0|
|     AS|          0.0|
|     DL|          3.0|
|     UA|          4.0|
+-------+-------------+
only showing top 7 rows

OneHotEncoder¶

In [41]:

carrier_df_onehot = spark_flights.select("carrier")

from pyspark.ml.feature import OneHotEncoder, StringIndexer

stringIndexer = StringIndexer(inputCol="carrier", outputCol="carrier_index")
model = stringIndexer.fit(carrier_df_onehot)
indexed = model.transform(carrier_df_onehot)

encoder = OneHotEncoder(dropLast=False, inputCol="carrier_index", outputCol="carrier_vec")
encoded = encoder.transform(indexed)

encoded.show(7)

+-------+-------------+--------------+
|carrier|carrier_index|   carrier_vec|
+-------+-------------+--------------+
|     AS|          0.0|(11,[0],[1.0])|
|     US|          6.0|(11,[6],[1.0])|
|     UA|          4.0|(11,[4],[1.0])|
|     US|          6.0|(11,[6],[1.0])|
|     AS|          0.0|(11,[0],[1.0])|
|     DL|          3.0|(11,[3],[1.0])|
|     UA|          4.0|(11,[4],[1.0])|
+-------+-------------+--------------+
only showing top 7 rows

Example¶

In [42]:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df1 = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c"),
    (6, "a"),
    (7, "b"),
    (8, "d"),
    (9, "d")
], ["id", "category"])

df2 = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

df = df2

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(dropLast=True, inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

+---+--------+-------------+-------------+
| id|category|categoryIndex|  categoryVec|
+---+--------+-------------+-------------+
|  0|       a|          0.0|(2,[0],[1.0])|
|  1|       b|          2.0|    (2,[],[])|
|  2|       c|          1.0|(2,[1],[1.0])|
|  3|       a|          0.0|(2,[0],[1.0])|
|  4|       a|          0.0|(2,[0],[1.0])|
|  5|       c|          1.0|(2,[1],[1.0])|
+---+--------+-------------+-------------+

VectorIndexer¶

In [43]:

from pyspark.ml.feature import VectorIndexer

data = spark.read.format("libsvm").load("data/sample_libsvm_data.txt")

indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=10)
indexerModel = indexer.fit(data)

categoricalFeatures = indexerModel.categoryMaps
print("Chose %d categorical features: %s" %
      (len(categoricalFeatures), ", ".join(str(k) for k in categoricalFeatures.keys())))

# Create new column "indexed" with categorical values transformed to indices
indexedData = indexerModel.transform(data)
indexedData.show()

Chose 351 categorical features: 645, 69, 365, 138, 101, 479, 333, 249, 0, 555, 666, 88, 170, 115, 276, 308, 5, 449, 120, 247, 614, 677, 202, 10, 56, 533, 142, 500, 340, 670, 174, 42, 417, 24, 37, 25, 257, 389, 52, 14, 504, 110, 587, 619, 196, 559, 638, 20, 421, 46, 93, 284, 228, 448, 57, 78, 29, 475, 164, 591, 646, 253, 106, 121, 84, 480, 147, 280, 61, 221, 396, 89, 133, 116, 1, 507, 312, 74, 307, 452, 6, 248, 60, 117, 678, 529, 85, 201, 220, 366, 534, 102, 334, 28, 38, 561, 392, 70, 424, 192, 21, 137, 165, 33, 92, 229, 252, 197, 361, 65, 97, 665, 583, 285, 224, 650, 615, 9, 53, 169, 593, 141, 610, 420, 109, 256, 225, 339, 77, 193, 669, 476, 642, 637, 590, 679, 96, 393, 647, 173, 13, 41, 503, 134, 73, 105, 2, 508, 311, 558, 674, 530, 586, 618, 166, 32, 34, 148, 45, 161, 279, 64, 689, 17, 149, 584, 562, 176, 423, 191, 22, 44, 59, 118, 281, 27, 641, 71, 391, 12, 445, 54, 313, 611, 144, 49, 335, 86, 672, 172, 113, 681, 219, 419, 81, 230, 362, 451, 76, 7, 39, 649, 98, 616, 477, 367, 535, 103, 140, 621, 91, 66, 251, 668, 198, 108, 278, 223, 394, 306, 135, 563, 226, 3, 505, 80, 167, 35, 473, 675, 589, 162, 531, 680, 255, 648, 112, 617, 194, 145, 48, 557, 690, 63, 640, 18, 282, 95, 310, 50, 67, 199, 673, 16, 585, 502, 338, 643, 31, 336, 613, 11, 72, 175, 446, 612, 143, 43, 250, 231, 450, 99, 363, 556, 87, 203, 671, 688, 104, 368, 588, 40, 304, 26, 258, 390, 55, 114, 171, 139, 418, 23, 8, 75, 119, 58, 667, 478, 536, 82, 620, 447, 36, 168, 146, 30, 51, 190, 19, 422, 564, 305, 107, 4, 136, 506, 79, 195, 474, 664, 532, 94, 283, 395, 332, 528, 644, 47, 15, 163, 200, 68, 62, 277, 691, 501, 90, 111, 254, 227, 337, 122, 83, 309, 560, 639, 676, 222, 592, 364, 100
+-----+--------------------+--------------------+
|label|            features|             indexed|
+-----+--------------------+--------------------+
|  0.0|(692,[127,128,129...|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|(692,[127,128,129...|
|  1.0|(692,[154,155,156...|(692,[154,155,156...|
|  0.0|(692,[153,154,155...|(692,[153,154,155...|
|  0.0|(692,[151,152,153...|(692,[151,152,153...|
|  1.0|(692,[129,130,131...|(692,[129,130,131...|
|  0.0|(692,[154,155,156...|(692,[154,155,156...|
|  1.0|(692,[150,151,152...|(692,[150,151,152...|
|  0.0|(692,[124,125,126...|(692,[124,125,126...|
|  0.0|(692,[152,153,154...|(692,[152,153,154...|
|  1.0|(692,[97,98,99,12...|(692,[97,98,99,12...|
|  1.0|(692,[124,125,126...|(692,[124,125,126...|
+-----+--------------------+--------------------+
only showing top 20 rows

https://www.datacamp.com/community/tutorials/categorical-data

http://contrib.scikit-learn.org/categorical-encoding/_modules/index.html

	AS	UA	US
0	1	0	0
1	0	0	1
2	0	1	0
3	0	0	1
4	1	0	0

	AS	UA	US
0	1	0	0
1	0	0	1
2	0	1	0
3	0	0	1
4	1	0	0

	AS	UA	US
0	1	0	0
1	0	0	1
2	0	1	0
3	0	0	1
4	1	0	0