In this excercise, you will use:
This excercise is part of the Scaling Machine Learning with Spark book available on the O'Reilly platform or on Amazon.
# Create SparkSession from builder
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
.appName('Scalling_ml_with_spark-week_2') \
.getOrCreate()
Extract an NGram using Tokanizer and connect data marshaling with Feature engineering.
For that, use NGram
functionality.
Checkout this:
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
countTokens = udf(lambda words: len(words), IntegerType())
wordDataFrame = tokenizer.transform(sentenceDataFrame)
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)
from pyspark.ml.feature import NGram
from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark "),
(1, "I wish, wish Java, Java could"),
(2, "Logistic regression, regression models")
], ["id", "sentence"])
# your solution goes here
# ...
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation
Take the DF and reduce the demensions of the vectores using PCA
.
here:
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])
# your solution goes here
# ...