Categorical features can only take on a limited, and usually fixed, number of possible values. For example, if a dataset is about information related to users, then you will typically find features like country, gender, age group, etc. Alternatively, if the data you're working with is related to products, you will find features like product type, manufacturer, seller and so on.
These are all categorical features in your dataset. These features are typically stored as text values which represent various traits of the observations. For example, gender is described as Male (M) or Female (F), product type could be described as electronics, apparels, food etc.
Note that these type of features where the categories are only labeled without any order of precedence are called nominal features. Features which have some order associated with them are called ordinal features.
For example, a feature like economic status, with three categories: low, medium and high, which have an order associated with them.
There are also continuous features. These are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or a date/time.
Regardless of what the value is used for, the challenge is determining how to use this data in the analysis because of the following constraints:
Categorical features may have a very large number of levels, known as high cardinality, (for example, cities or URLs), where most of the levels appear in a relatively small number of instances.
Many machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them.
While some ML packages or libraries might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs.
For the machine, categorical data doesn’t contain the same context or information that humans can easily associate and understand. For example, when looking at a feature called City
with three cities New York
, New Jersey
and New Delhi
, humans can infer that New York
is closely related to New Jersey
as they are from same country, while New York
and New Delhi
are much different. But for the model, New York
, New Jersey
and New Delhi
, are just three different levels (possible values) of the same feature City
. If you don’t specify the additional contextual information, it will be impossible for the model to differentiate between highly different levels.
You therefore are faced with the challenge of figuring out how to turn these text values into numerical values for further processing and unmask lots of interesting information which these features might hide. Typically, any standard work-flow in feature engineering involves some form of transformation of these categorical values into numeric labels and then applying some encoding scheme on these values.
Existing Encoding Methods (all modules for which code is available, see: http://contrib.scikit-learn.org/categorical-encoding/_modules/index.html)
The techniques that you'll cover are the following:
1. Replace Values¶
import pandas as pd
import numpy as np
import copy
%matplotlib inline
df_flights = pd.read_csv('https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv')
df_flights.head()
year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014 | 1 | 1 | 1.0 | 96.0 | 235.0 | 70.0 | AS | N508AS | 145 | PDX | ANC | 194.0 | 1542 | 0.0 | 1.0 |
1 | 2014 | 1 | 1 | 4.0 | -6.0 | 738.0 | -23.0 | US | N195UW | 1830 | SEA | CLT | 252.0 | 2279 | 0.0 | 4.0 |
2 | 2014 | 1 | 1 | 8.0 | 13.0 | 548.0 | -4.0 | UA | N37422 | 1609 | PDX | IAH | 201.0 | 1825 | 0.0 | 8.0 |
3 | 2014 | 1 | 1 | 28.0 | -2.0 | 800.0 | -23.0 | US | N547UW | 466 | PDX | CLT | 251.0 | 2282 | 0.0 | 28.0 |
4 | 2014 | 1 | 1 | 34.0 | 44.0 | 325.0 | 43.0 | AS | N762AS | 121 | SEA | ANC | 201.0 | 1448 | 0.0 | 34.0 |
cat_df_flights = df_flights.select_dtypes(include=['object']).copy()
cat_df_flights.head()
carrier | tailnum | origin | dest | |
---|---|---|---|---|
0 | AS | N508AS | PDX | ANC |
1 | US | N195UW | SEA | CLT |
2 | UA | N37422 | PDX | IAH |
3 | US | N547UW | PDX | CLT |
4 | AS | N762AS | SEA | ANC |
print(cat_df_flights.isnull().values.sum())
248
cat_df_flights = cat_df_flights.fillna(cat_df_flights['tailnum'].value_counts().index[0])
print(cat_df_flights.isnull().values.sum())
0
replace_map = {'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7 , 'UA': 8 , 'US': 9,'VX': 10,'WN': 11}}
labels = cat_df_flights['carrier'].astype('category').cat.categories.tolist()
replace_map_comp = {'carrier' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
print(replace_map_comp)
{'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7, 'UA': 8, 'US': 9, 'VX': 10, 'WN': 11}}
cat_df_flights_replace = cat_df_flights.copy()
cat_df_flights_replace.replace(replace_map_comp, inplace=True)
print(cat_df_flights_replace.head())
carrier tailnum origin dest 0 2 N508AS PDX ANC 1 9 N195UW SEA CLT 2 8 N37422 PDX IAH 3 9 N547UW PDX CLT 4 2 N762AS SEA ANC
print(cat_df_flights_replace['carrier'].dtypes)
int64
cat_df_flights_lc = cat_df_flights.copy()
cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].astype('category')
cat_df_flights_lc['origin'] = cat_df_flights_lc['origin'].astype('category')
print(cat_df_flights_lc.dtypes)
carrier category tailnum object origin category dest object dtype: object
import time
%timeit cat_df_flights.groupby(['origin','carrier']).count() #DataFrame with object dtype columns
31.3 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit cat_df_flights_lc.groupby(['origin','carrier']).count() #DataFrame with category dtype columns
21.6 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2. Label Encoding¶
cat_df_flights_lc.head()
carrier | tailnum | origin | dest | |
---|---|---|---|---|
0 | AS | N508AS | PDX | ANC |
1 | US | N195UW | SEA | CLT |
2 | UA | N37422 | PDX | IAH |
3 | US | N547UW | PDX | CLT |
4 | AS | N762AS | SEA | ANC |
cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].cat.codes
cat_df_flights_lc.head() #alphabetically labeled from 0 to 10
carrier | tailnum | origin | dest | |
---|---|---|---|---|
0 | 1 | N508AS | PDX | ANC |
1 | 8 | N195UW | SEA | CLT |
2 | 7 | N37422 | PDX | IAH |
3 | 8 | N547UW | PDX | CLT |
4 | 1 | N762AS | SEA | ANC |
cat_df_flights_specific = cat_df_flights.copy()
cat_df_flights_specific['US_code'] = np.where(cat_df_flights_specific['carrier'].str.contains('US'), 1, 0)
cat_df_flights_specific.head()
carrier | tailnum | origin | dest | US_code | |
---|---|---|---|---|---|
0 | AS | N508AS | PDX | ANC | 0 |
1 | US | N195UW | SEA | CLT | 1 |
2 | UA | N37422 | PDX | IAH | 0 |
3 | US | N547UW | PDX | CLT | 1 |
4 | AS | N762AS | SEA | ANC | 0 |
cat_df_flights_sklearn = cat_df_flights.copy()
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()
cat_df_flights_sklearn['carrier_code'] = lb_make.fit_transform(cat_df_flights['carrier'])
cat_df_flights_sklearn.head() #Results in appending a new column to df
carrier | tailnum | origin | dest | carrier_code | |
---|---|---|---|---|---|
0 | AS | N508AS | PDX | ANC | 1 |
1 | US | N195UW | SEA | CLT | 8 |
2 | UA | N37422 | PDX | IAH | 7 |
3 | US | N547UW | PDX | CLT | 8 |
4 | AS | N762AS | SEA | ANC | 1 |
Label encoding is pretty much intuitive and straight-forward and may give you a good performance from your learning algorithm, but it has as disadvantage that the numerical values can be misinterpreted by the algorithm. Should the carrier US (encoded to 8) be given 8x more weight than the carrier AS (encoded to 1) ?
To solve this issue there is another popular way to encode the categories via something called one-hot encoding.
3. One-Hot encoding¶
cat_df_flights.head()
carrier | tailnum | origin | dest | |
---|---|---|---|---|
0 | AS | N508AS | PDX | ANC |
1 | US | N195UW | SEA | CLT |
2 | UA | N37422 | PDX | IAH |
3 | US | N547UW | PDX | CLT |
4 | AS | N762AS | SEA | ANC |
cat_df_flights_onehot = cat_df_flights.copy()
cat_df_flights_onehot = pd.get_dummies(cat_df_flights_onehot, columns=['carrier'], prefix = ['carrier'])
cat_df_flights_onehot.head()
tailnum | origin | dest | carrier_AA | carrier_AS | carrier_B6 | carrier_DL | carrier_F9 | carrier_HA | carrier_OO | carrier_UA | carrier_US | carrier_VX | carrier_WN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | N508AS | PDX | ANC | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | N195UW | SEA | CLT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | N37422 | PDX | IAH | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | N547UW | PDX | CLT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | N762AS | SEA | ANC | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
cat_df_flights_onehot_sklearn = cat_df_flights.copy()
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb_results = lb.fit_transform(cat_df_flights_onehot_sklearn['carrier'])
lb_results_df = pd.DataFrame(lb_results, columns=lb.classes_)
lb_results_df.head()
AA | AS | B6 | DL | F9 | HA | OO | UA | US | VX | WN | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
result_df = pd.concat([cat_df_flights_onehot_sklearn, lb_results_df], axis=1)
result_df.head()
carrier | tailnum | origin | dest | AA | AS | B6 | DL | F9 | HA | OO | UA | US | VX | WN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AS | N508AS | PDX | ANC | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | US | N195UW | SEA | CLT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | UA | N37422 | PDX | IAH | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | US | N547UW | PDX | CLT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | AS | N762AS | SEA | ANC | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4. Binary Encoding¶
cat_df_flights_ce = cat_df_flights.copy()
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['carrier'])
df_binary = encoder.fit_transform(cat_df_flights_ce)
df_binary.head()
carrier_0 | carrier_1 | carrier_2 | carrier_3 | carrier_4 | tailnum | origin | dest | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | N508AS | PDX | ANC |
1 | 0 | 0 | 0 | 1 | 0 | N195UW | SEA | CLT |
2 | 0 | 0 | 0 | 1 | 1 | N37422 | PDX | IAH |
3 | 0 | 0 | 0 | 1 | 0 | N547UW | PDX | CLT |
4 | 0 | 0 | 0 | 0 | 1 | N762AS | SEA | ANC |
df_binary[df_binary['carrier_0']==1]
carrier_0 | carrier_1 | carrier_2 | carrier_3 | carrier_4 | tailnum | origin | dest |
---|
5. Backward Difference Encoding¶
encoder = ce.BackwardDifferenceEncoder(cols=['carrier'])
df_bd = encoder.fit_transform(cat_df_flights_ce)
df_bd.head()
col_carrier_0 | col_carrier_1 | col_carrier_2 | col_carrier_3 | col_carrier_4 | col_carrier_5 | col_carrier_6 | col_carrier_7 | col_carrier_8 | col_carrier_9 | col_carrier_10 | col_tailnum | col_origin | col_dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | -0.909091 | -0.818182 | -0.727273 | -0.636364 | -0.545455 | -0.454545 | -0.363636 | -0.272727 | -0.181818 | -0.090909 | N508AS | PDX | ANC |
1 | 1.0 | 0.090909 | -0.818182 | -0.727273 | -0.636364 | -0.545455 | -0.454545 | -0.363636 | -0.272727 | -0.181818 | -0.090909 | N195UW | SEA | CLT |
2 | 1.0 | 0.090909 | 0.181818 | -0.727273 | -0.636364 | -0.545455 | -0.454545 | -0.363636 | -0.272727 | -0.181818 | -0.090909 | N37422 | PDX | IAH |
3 | 1.0 | 0.090909 | -0.818182 | -0.727273 | -0.636364 | -0.545455 | -0.454545 | -0.363636 | -0.272727 | -0.181818 | -0.090909 | N547UW | PDX | CLT |
4 | 1.0 | -0.909091 | -0.818182 | -0.727273 | -0.636364 | -0.545455 | -0.454545 | -0.363636 | -0.272727 | -0.181818 | -0.090909 | N762AS | SEA | ANC |
np.unique(df_bd['col_carrier_1'])
array([-0.90909091, 0.09090909])
6. Polynomial Encoding¶
encoder = ce.PolynomialEncoder(cols=['carrier'])
df_bd = encoder.fit_transform(cat_df_flights_ce)
df_bd.head()
col_carrier_0 | col_carrier_1 | col_carrier_2 | col_carrier_3 | col_carrier_4 | col_carrier_5 | col_carrier_6 | col_carrier_7 | col_carrier_8 | col_carrier_9 | col_carrier_10 | col_tailnum | col_origin | col_dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | -0.476731 | 0.512092 | -0.458029 | 0.354787 | -0.240192 | 0.141610 | -0.071707 | 0.030334 | -0.010141 | 0.002326 | N508AS | PDX | ANC |
1 | 1.0 | -0.381385 | 0.204837 | 0.091606 | -0.354787 | 0.480384 | -0.453153 | 0.329853 | -0.188069 | 0.081127 | -0.023265 | N195UW | SEA | CLT |
2 | 1.0 | -0.286039 | -0.034139 | 0.335888 | -0.354787 | 0.080064 | 0.273780 | -0.473267 | 0.442872 | -0.273805 | 0.104692 | N37422 | PDX | IAH |
3 | 1.0 | -0.381385 | 0.204837 | 0.091606 | -0.354787 | 0.480384 | -0.453153 | 0.329853 | -0.188069 | 0.081127 | -0.023265 | N547UW | PDX | CLT |
4 | 1.0 | -0.476731 | 0.512092 | -0.458029 | 0.354787 | -0.240192 | 0.141610 | -0.071707 | 0.030334 | -0.010141 | 0.002326 | N762AS | SEA | ANC |
np.unique(df_bd['col_carrier_1'])
array([-4.76731295e-01, -3.81385036e-01, -2.86038777e-01, -1.90692518e-01, -9.53462589e-02, -1.16688970e-17, 9.53462589e-02, 1.90692518e-01, 2.86038777e-01, 3.81385036e-01, 4.76731295e-01])
7. Miscellaneous Features¶
dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})
dummy_df_age['start'], dummy_df_age['end'] = zip(*dummy_df_age['age'].map(lambda x: x.split('-')))
dummy_df_age.head()
age | start | end | |
---|---|---|---|
0 | 0-20 | 0 | 20 |
1 | 20-40 | 20 | 40 |
2 | 40-60 | 40 | 60 |
3 | 60-80 | 60 | 80 |
dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})
def split_mean(x):
split_list = x.split('-')
mean = (float(split_list[0])+float(split_list[1]))/2
return mean
dummy_df_age['age_mean'] = dummy_df_age['age'].apply(lambda x: split_mean(x))
dummy_df_age.head()
age | age_mean | |
---|---|---|
0 | 0-20 | 10.0 |
1 | 20-40 | 30.0 |
2 | 40-60 | 50.0 |
3 | 60-80 | 70.0 |
The first step in Spark programming is to create a SparkContext. SparkContext is required when you want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. You'll start by importing SparkContext.
To start working with Spark DataFrames, you first have to create a SparkSession object from your SparkContext.
1st way¶
#import findspark
#findspark.init()
#import pyspark
#confspark = pyspark.SparkConf().setMaster("local[*]").set("spark.cores.max", "4").set("spark.executor.memory", "2G").setAppName("--test--")
#sc = pyspark.SparkContext(conf=confspark)
#sc._conf.getAll()
#from pyspark.sql import SparkSession
#spark = SparkSession(sc)
#sc.stop()
2nd way¶
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
#confspark = pyspark.SparkConf().setMaster("local[4]").set("spark.cores.max", "4").set("spark.executor.memory", "2G").setAppName("--test--")
#spark = SparkSession.builder.config(conf=confspark).getOrCreate()
spark = SparkSession.builder.master("local[*]").appName("--test--").config("spark.some.config.option", "some-value").getOrCreate()
spark.version
'2.3.1'
spark.catalog.listTables()
[]
spark_flights = spark.read.format("csv").option('header',True).load('data/flights.csv',inferSchema=True)
spark_flights.show(3)
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+ |year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute| +----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+ |2014| 1| 1| 1| 96| 235| 70| AS| N508AS| 145| PDX| ANC| 194| 1542| 0| 1| |2014| 1| 1| 4| -6| 738| -23| US| N195UW| 1830| SEA| CLT| 252| 2279| 0| 4| |2014| 1| 1| 8| 13| 548| -4| UA| N37422| 1609| PDX| IAH| 201| 1825| 0| 8| +----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+ only showing top 3 rows
spark_flights.printSchema()
root |-- year: integer (nullable = true) |-- month: integer (nullable = true) |-- day: integer (nullable = true) |-- dep_time: string (nullable = true) |-- dep_delay: string (nullable = true) |-- arr_time: string (nullable = true) |-- arr_delay: string (nullable = true) |-- carrier: string (nullable = true) |-- tailnum: string (nullable = true) |-- flight: integer (nullable = true) |-- origin: string (nullable = true) |-- dest: string (nullable = true) |-- air_time: string (nullable = true) |-- distance: integer (nullable = true) |-- hour: string (nullable = true) |-- minute: string (nullable = true)
spark.catalog.listTables()
[]
spark_flights.createOrReplaceTempView("flights_temp")
spark.catalog.listTables()
[Table(name='flights_temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
carrier_df = spark_flights.select("carrier")
carrier_df.show(5)
+-------+ |carrier| +-------+ | AS| | US| | UA| | US| | AS| +-------+ only showing top 5 rows
StringIndexer¶
from pyspark.ml.feature import StringIndexer
carr_indexer = StringIndexer(inputCol="carrier",outputCol="carrier_index")
carr_indexed = carr_indexer.fit(carrier_df).transform(carrier_df)
carr_indexed.show(7)
+-------+-------------+ |carrier|carrier_index| +-------+-------------+ | AS| 0.0| | US| 6.0| | UA| 4.0| | US| 6.0| | AS| 0.0| | DL| 3.0| | UA| 4.0| +-------+-------------+ only showing top 7 rows
OneHotEncoder¶
carrier_df_onehot = spark_flights.select("carrier")
from pyspark.ml.feature import OneHotEncoder, StringIndexer
stringIndexer = StringIndexer(inputCol="carrier", outputCol="carrier_index")
model = stringIndexer.fit(carrier_df_onehot)
indexed = model.transform(carrier_df_onehot)
encoder = OneHotEncoder(dropLast=False, inputCol="carrier_index", outputCol="carrier_vec")
encoded = encoder.transform(indexed)
encoded.show(7)
+-------+-------------+--------------+ |carrier|carrier_index| carrier_vec| +-------+-------------+--------------+ | AS| 0.0|(11,[0],[1.0])| | US| 6.0|(11,[6],[1.0])| | UA| 4.0|(11,[4],[1.0])| | US| 6.0|(11,[6],[1.0])| | AS| 0.0|(11,[0],[1.0])| | DL| 3.0|(11,[3],[1.0])| | UA| 4.0|(11,[4],[1.0])| +-------+-------------+--------------+ only showing top 7 rows
Example¶
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df1 = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c"),
(6, "a"),
(7, "b"),
(8, "d"),
(9, "d")
], ["id", "category"])
df2 = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
df = df2
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(dropLast=True, inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()
+---+--------+-------------+-------------+ | id|category|categoryIndex| categoryVec| +---+--------+-------------+-------------+ | 0| a| 0.0|(2,[0],[1.0])| | 1| b| 2.0| (2,[],[])| | 2| c| 1.0|(2,[1],[1.0])| | 3| a| 0.0|(2,[0],[1.0])| | 4| a| 0.0|(2,[0],[1.0])| | 5| c| 1.0|(2,[1],[1.0])| +---+--------+-------------+-------------+
VectorIndexer¶
from pyspark.ml.feature import VectorIndexer
data = spark.read.format("libsvm").load("data/sample_libsvm_data.txt")
indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=10)
indexerModel = indexer.fit(data)
categoricalFeatures = indexerModel.categoryMaps
print("Chose %d categorical features: %s" %
(len(categoricalFeatures), ", ".join(str(k) for k in categoricalFeatures.keys())))
# Create new column "indexed" with categorical values transformed to indices
indexedData = indexerModel.transform(data)
indexedData.show()
Chose 351 categorical features: 645, 69, 365, 138, 101, 479, 333, 249, 0, 555, 666, 88, 170, 115, 276, 308, 5, 449, 120, 247, 614, 677, 202, 10, 56, 533, 142, 500, 340, 670, 174, 42, 417, 24, 37, 25, 257, 389, 52, 14, 504, 110, 587, 619, 196, 559, 638, 20, 421, 46, 93, 284, 228, 448, 57, 78, 29, 475, 164, 591, 646, 253, 106, 121, 84, 480, 147, 280, 61, 221, 396, 89, 133, 116, 1, 507, 312, 74, 307, 452, 6, 248, 60, 117, 678, 529, 85, 201, 220, 366, 534, 102, 334, 28, 38, 561, 392, 70, 424, 192, 21, 137, 165, 33, 92, 229, 252, 197, 361, 65, 97, 665, 583, 285, 224, 650, 615, 9, 53, 169, 593, 141, 610, 420, 109, 256, 225, 339, 77, 193, 669, 476, 642, 637, 590, 679, 96, 393, 647, 173, 13, 41, 503, 134, 73, 105, 2, 508, 311, 558, 674, 530, 586, 618, 166, 32, 34, 148, 45, 161, 279, 64, 689, 17, 149, 584, 562, 176, 423, 191, 22, 44, 59, 118, 281, 27, 641, 71, 391, 12, 445, 54, 313, 611, 144, 49, 335, 86, 672, 172, 113, 681, 219, 419, 81, 230, 362, 451, 76, 7, 39, 649, 98, 616, 477, 367, 535, 103, 140, 621, 91, 66, 251, 668, 198, 108, 278, 223, 394, 306, 135, 563, 226, 3, 505, 80, 167, 35, 473, 675, 589, 162, 531, 680, 255, 648, 112, 617, 194, 145, 48, 557, 690, 63, 640, 18, 282, 95, 310, 50, 67, 199, 673, 16, 585, 502, 338, 643, 31, 336, 613, 11, 72, 175, 446, 612, 143, 43, 250, 231, 450, 99, 363, 556, 87, 203, 671, 688, 104, 368, 588, 40, 304, 26, 258, 390, 55, 114, 171, 139, 418, 23, 8, 75, 119, 58, 667, 478, 536, 82, 620, 447, 36, 168, 146, 30, 51, 190, 19, 422, 564, 305, 107, 4, 136, 506, 79, 195, 474, 664, 532, 94, 283, 395, 332, 528, 644, 47, 15, 163, 200, 68, 62, 277, 691, 501, 90, 111, 254, 227, 337, 122, 83, 309, 560, 639, 676, 222, 592, 364, 100 +-----+--------------------+--------------------+ |label| features| indexed| +-----+--------------------+--------------------+ | 0.0|(692,[127,128,129...|(692,[127,128,129...| | 1.0|(692,[158,159,160...|(692,[158,159,160...| | 1.0|(692,[124,125,126...|(692,[124,125,126...| | 1.0|(692,[152,153,154...|(692,[152,153,154...| | 1.0|(692,[151,152,153...|(692,[151,152,153...| | 0.0|(692,[129,130,131...|(692,[129,130,131...| | 1.0|(692,[158,159,160...|(692,[158,159,160...| | 1.0|(692,[99,100,101,...|(692,[99,100,101,...| | 0.0|(692,[154,155,156...|(692,[154,155,156...| | 0.0|(692,[127,128,129...|(692,[127,128,129...| | 1.0|(692,[154,155,156...|(692,[154,155,156...| | 0.0|(692,[153,154,155...|(692,[153,154,155...| | 0.0|(692,[151,152,153...|(692,[151,152,153...| | 1.0|(692,[129,130,131...|(692,[129,130,131...| | 0.0|(692,[154,155,156...|(692,[154,155,156...| | 1.0|(692,[150,151,152...|(692,[150,151,152...| | 0.0|(692,[124,125,126...|(692,[124,125,126...| | 0.0|(692,[152,153,154...|(692,[152,153,154...| | 1.0|(692,[97,98,99,12...|(692,[97,98,99,12...| | 1.0|(692,[124,125,126...|(692,[124,125,126...| +-----+--------------------+--------------------+ only showing top 20 rows