Notebook

⭐ Scaling Machine Learning in Three Week course - Week 2:¶

PySpark - text - feature engineering¶

In this excercise, you will use:

Bot data set
Work with Null values

This excercise is part of the Scaling Machine Learning with Spark book available on the O'Reilly platform or on Amazon.

In [4]:

# Create SparkSession from builder
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('Scalling_ml_with_spark-week_2') \
                    .getOrCreate()

✅ Task 1 : Tokanizer -> N Gram¶

Extract an NGram using Tokanizer and connect data marshaling with Feature engineering. For that, use NGram functionality.

Hint

Checkout this:

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
countTokens = udf(lambda words: len(words), IntegerType())
wordDataFrame = tokenizer.transform(sentenceDataFrame)

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")

ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)

In [5]:

from pyspark.ml.feature import NGram
from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark "),
    (1, "I wish, wish Java, Java could"),
    (2, "Logistic regression, regression models")
], ["id", "sentence"])

# your solution goes here
# ...

✅ Task 2 : PCA¶

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation

Take the DF and reduce the demensions of the vectores using PCA.

Click here to see the Solution

here:

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

In [6]:

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
        (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
        (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

# your solution goes here
# ...

In [ ]: