This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
This lab will build on the techniques covered in the Spark tutorial to develop a simple word count application. The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this lab, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg.
This could also be scaled to find the most common words in Wikipedia.
** During this lab we will cover: **
Note that for reference, you can look up the details of the relevant methods in:
labVersion = 'cs190.1x-lab2-1.0.4'
In this part of the lab, we will explore creating a base RDD with parallelize
and using pair RDDs to count words.
** (1a) Create a base RDD **
We'll start by generating a base RDD by using a Python list and the sc.parallelize
method. Then we'll print out the type of the base RDD.
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print(type(wordsRDD))
** (1b) Pluralize and test **
Let's use a map()
transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word. Please replace <FILL IN>
with your solution. If you have trouble, the next cell has the solution. After you have defined makePlural
you can run the third cell which contains a test. If you implementation is correct it will print 1 test passed
.
This is the general form that exercises will take, except that no example solution will be provided. Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more <FILL IN>
sections. The cell that needs to be modified will have # TODO: Replace <FILL IN> with appropriate code
on its first line. Once the <FILL IN>
sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution. The last code cell before the next markdown section will contain the tests.
# One way of completing the function
def makePlural(word):
return word + 's'
print(makePlural('cat'))
WARNING: If test_helper, required in the cell below, is not installed, follow the instructions here.
# Load in the testing code and check to see if your answer is correct
# If incorrect it will report back '1 test failed' for each failed test
# Make sure to rerun any cell you change before trying the test again
from test_helper import Test
# TEST Pluralize and test (1b)
Test.assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')
# PRIVATE_TEST Pluralize and test (1b)
Test.assertEquals(makePlural('cat'), 'cats', 'incorrect result: makePlural does not add an s')
# ANSWER
pluralRDD = wordsRDD.map(makePlural)
print pluralRDD.collect()
# TEST Apply makePlural to the base RDD(1c)
Test.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],
'incorrect values for pluralRDD')
# PRIVATE_TEST Apply makePlural to the base RDD(1c)
Test.assertEquals(pluralRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],
'incorrect values for pluralRDD')
** (1d) Pass a lambda
function to map
**
Let's create the same RDD using a lambda
function.
# ANSWER
pluralLambdaRDD = wordsRDD.map(lambda word: word + 's')
print pluralLambdaRDD.collect()
# TEST Pass a lambda function to map (1d)
Test.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],
'incorrect values for pluralLambdaRDD (1d)')
# PRIVATE_TEST Pass a lambda function to map (1d)
Test.assertEquals(pluralLambdaRDD.collect(), ['cats', 'elephants', 'rats', 'rats', 'cats'],
'incorrect values for pluralLambdaRDD (1d)')
** (1e) Length of each word **
Now use map()
and a lambda
function to return the number of characters in each word. We'll collect
this result directly into a variable.
# ANSWER
pluralLengths = (pluralRDD
.map(lambda x: len(x))
.collect())
print pluralLengths
# TEST Length of each word (1e)
Test.assertEquals(pluralLengths, [4, 9, 4, 4, 4],
'incorrect values for pluralLengths')
# PRIVATE_TEST Length of each word (1e)
Test.assertEquals(pluralLengths, [4, 9, 4, 4, 4],
'incorrect values for pluralLengths')
** (1f) Pair RDDs **
The next step in writing our word counting program is to create a new type of RDD, called a pair RDD. A pair RDD is an RDD where each element is a pair tuple (k, v)
where k
is the key and v
is the value. In this example, we will create a pair consisting of ('<word>', 1)
for each word element in the RDD.
We can create the pair RDD using the map()
transformation with a lambda()
function to create a new RDD.
# ANSWER
wordPairs = wordsRDD.map(lambda s: (s, 1))
print wordPairs.collect()
# TEST Pair RDDs (1f)
Test.assertEquals(wordPairs.collect(),
[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],
'incorrect value for wordPairs')
# PRIVATE_TEST Pair RDDs (1f)
Test.assertEquals(wordPairs.collect(),
[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)],
'incorrect value for wordPairs')
Now, let's count the number of times a particular word appears in the RDD. There are multiple ways to perform the counting, but some are much less efficient than others.
A naive approach would be to collect()
all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations.
** (2a) groupByKey()
approach **
An approach you might first consider (we'll see shortly that there are better ways) is based on using the groupByKey() transformation. As the name implies, the groupByKey()
transformation groups all the elements of the RDD with the same key into a single list in one of the partitions.
There are two problems with using groupByKey()
:
Use groupByKey()
to generate a pair RDD of type ('word', iterator)
.
# ANSWER
# Note that groupByKey requires no parameters
wordsGrouped = wordPairs.groupByKey()
for key, value in wordsGrouped.collect():
print '{0}: {1}'.format(key, list(value))
# TEST groupByKey() approach (2a)
Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),
[('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],
'incorrect value for wordsGrouped')
# PRIVATE_TEST groupByKey() approach (2a)
Test.assertEquals(sorted(wordsGrouped.mapValues(lambda x: list(x)).collect()),
[('cat', [1, 1]), ('elephant', [1]), ('rat', [1, 1])],
'incorrect value for wordsGrouped')
** (2b) Use groupByKey()
to obtain the counts **
Using the groupByKey()
transformation creates an RDD containing 3 elements, each of which is a pair of a word and a Python iterator.
Now sum the iterator using a map()
transformation. The result should be a pair RDD consisting of (word, count) pairs.
# ANSWER
wordCountsGrouped = wordsGrouped.map(lambda (k, v): (k, sum(v)))
print wordCountsGrouped.collect()
# TEST Use groupByKey() to obtain the counts (2b)
Test.assertEquals(sorted(wordCountsGrouped.collect()),
[('cat', 2), ('elephant', 1), ('rat', 2)],
'incorrect value for wordCountsGrouped')
# PRIVATE_TEST Use groupByKey() to obtain the counts (2b)
Test.assertEquals(sorted(wordCountsGrouped.collect()),
[('cat', 2), ('elephant', 1), ('rat', 2)],
'incorrect value for wordCountsGrouped')
** (2c) Counting using reduceByKey
**
A better approach is to start from the pair RDD and then use the reduceByKey() transformation to create a new pair RDD. The reduceByKey()
transformation gathers together pairs that have the same key and applies the function provided to two values at a time, iteratively reducing all of the values to a single value. reduceByKey()
operates by applying the function first within each partition on a per-key basis and then across the partitions, allowing it to scale efficiently to large datasets.
# ANSWER
wordCounts = wordPairs.reduceByKey(lambda a, b: a + b)
print wordCounts.collect()
# TEST Counting using reduceByKey (2c)
Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],
'incorrect value for wordCounts')
# PRIVATE_TEST Counting using reduceByKey (2c)
Test.assertEquals(sorted(wordCounts.collect()), [('cat', 2), ('elephant', 1), ('rat', 2)],
'incorrect value for wordCounts')
** (2d) All together **
The expert version of the code performs the map()
to pair RDD, reduceByKey()
transformation, and collect
in one statement.
# ANSWER
wordCountsCollected = (wordsRDD
.map(lambda s: (s, 1))
.reduceByKey(lambda a, b : a + b)
.collect())
print wordCountsCollected
# TEST All together (2d)
Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],
'incorrect value for wordCountsCollected')
# PRIVATE_TEST All together (2d)
Test.assertEquals(sorted(wordCountsCollected), [('cat', 2), ('elephant', 1), ('rat', 2)],
'incorrect value for wordCountsCollected')
** (3a) Unique words **
Calculate the number of unique words in wordsRDD
. You can use other RDDs that you have already created to make this easier.
## ANSWER
uniqueWords = wordCounts.count()
print uniqueWords
# TEST Unique words (3a)
Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')
# PRIVATE_TEST Unique words (3a)
Test.assertEquals(uniqueWords, 3, 'incorrect count of uniqueWords')
** (3b) Mean using reduce
**
Find the mean number of words per unique word in wordCounts
.
Use a reduce()
action to sum the counts in wordCounts
and then divide by the number of unique words. First map()
the pair RDD wordCounts
, which consists of (key, value) pairs, to an RDD of values.
# ANSWER
from operator import add
totalCount = (wordCounts
.map(lambda (k, v): v)
.reduce(add))
average = totalCount / float(wordCounts.count())
print totalCount
print round(average, 2)
# TEST Mean using reduce (3b)
Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')
# PRIVATE_TEST Mean using reduce (3b)
Test.assertEquals(round(average, 2), 1.67, 'incorrect value of average')
In this section we will finish developing our word count application. We'll have to build the wordCount
function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data.
** (4a) wordCount
function **
First, define a function for word counting. You should reuse the techniques that have been covered in earlier parts of this lab. This function should take in an RDD that is a list of words like wordsRDD
and return a pair RDD that has all of the words and their associated counts.
# ANSWER
def wordCount(wordListRDD):
"""Creates a pair RDD with word counts from an RDD of words.
Args:
wordListRDD (RDD of str): An RDD consisting of words.
Returns:
RDD of (str, int): An RDD consisting of (word, count) tuples.
"""
return (wordListRDD
.map(lambda s: (s, 1))
.reduceByKey(add))
print wordCount(wordsRDD).collect()
# TEST wordCount function (4a)
Test.assertEquals(sorted(wordCount(wordsRDD).collect()),
[('cat', 2), ('elephant', 1), ('rat', 2)],
'incorrect definition for wordCount function')
# PRIVATE_TEST wordCount function (4a)
privateWordsRDD = sc.parallelize(['cat', 'cat', 'cat', 'rat', 'elephant', 'elephant'])
Test.assertEquals(sorted(wordCount(privateWordsRDD).collect()),
[('cat', 3), ('elephant', 2), ('rat', 1)],
'incorrect definition for wordCount function')
** (4b) Capitalization and punctuation **
Real world files are more complicated than the data we have been using in this lab. Some of the issues we have to address are:
Define the function removePunctuation
that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. Use the Python re module to remove any text that is not a letter, number, or space. Reading help(re.sub)
might be useful.
If you are unfamiliar with regular expressions, you may want to review this tutorial from Google. Also, this website is a great resource for debugging your regular expression.
# ANSWER
import re
def removePunctuation(text):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
text (str): A string.
Returns:
str: The cleaned up string.
"""
return re.sub(r'[^A-Za-z0-9 ]', '', text).lower().strip()
print removePunctuation('Hi, you!')
print removePunctuation(' No under_score!')
print removePunctuation(' * Remove punctuation then spaces * ')
# TEST Capitalization and punctuation (4b)
Test.assertEquals(removePunctuation(" The Elephant's 4 cats. "),
'the elephants 4 cats',
'incorrect definition for removePunctuation function')
# PRIVATE_TEST Capitalization and punctuation (4b)
Test.assertEquals(removePunctuation(" Hi, It's possible I'm cheating. "),
'hi its possible im cheating',
'incorrect definition for removePunctuation function')
** (4c) Load a text file **
For the next part of this lab, we will use the Complete Works of William Shakespeare from Project Gutenberg. To convert a text file into an RDD, we use the SparkContext.textFile()
method. We also apply the recently defined removePunctuation()
function using a map()
transformation to strip out the punctuation and change all text to lower case. Since the file is large we use take(15)
, so that we only print 15 lines.
# Just run this code
import os.path
baseDir = os.path.join('databricks-datasets')
inputPath = os.path.join('cs100', 'lab1', 'data-001', 'shakespeare.txt')
fileName = os.path.join(baseDir, inputPath)
shakespeareRDD = (sc
.textFile(fileName, 8)
.map(removePunctuation))
print '\n'.join(shakespeareRDD
.zipWithIndex() # to (line, lineNum)
.map(lambda (l, num): '{0}: {1}'.format(num, l)) # to 'lineNum: line'
.take(15))
** (4d) Words from lines **
Before we can use the wordcount()
function, we have to address two issues with the format of the RDD:
Apply a transformation that will split each element of the RDD by its spaces. For each element of the RDD, you should apply Python's string split() function. You might think that a map()
transformation is the way to do this, but think about what the result of the split()
function will be.
Note:
- Do not use the default implemenation of
split()
, but pass in a separator value. For example, to splitline
by commas you would useline.split(',')
.
# ANSWER
shakespeareWordsRDD = shakespeareRDD.flatMap(lambda x: x.split(' '))
shakespeareWordCount = shakespeareWordsRDD.count()
print shakespeareWordsRDD.top(5)
print shakespeareWordCount
# TEST Words from lines (4d)
# This test allows for leading spaces to be removed either before or after
# punctuation is removed.
Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,
'incorrect value for shakespeareWordCount')
Test.assertEquals(shakespeareWordsRDD.top(5),
[u'zwaggerd', u'zounds', u'zounds', u'zounds', u'zounds'],
'incorrect value for shakespeareWordsRDD')
# PRIVATE_TEST Words from lines (4d)
# This test allows for leading spaces to be removed either before or after
# punctuation is removed.
Test.assertTrue(shakespeareWordCount == 927631 or shakespeareWordCount == 928908,
'incorrect value for shakespeareWordCount')
Test.assertEquals(shakespeareWordsRDD.map(lambda x: len(x)).sum(), 3697209,
'incorrect value for shakespeareWordsRDD')
** (4e) Remove empty elements **
The next step is to filter out the empty elements. Remove all entries where the word is ''
.
# ANSWER
shakeWordsRDD = shakespeareWordsRDD.filter(lambda x: x != '')
shakeWordCount = shakeWordsRDD.count()
print shakeWordCount
# TEST Remove empty elements (4e)
Test.assertEquals(shakeWordCount, 882996, 'incorrect value for shakeWordCount')
# PRIVATE_TEST Remove empty elements (4e)
Test.assertEquals(shakeWordsRDD
.filter(lambda x: x == '')
.count(), 0, 'incorrect value for shakeWordsRDD')
** (4f) Count the words **
We now have an RDD that is only words. Next, let's apply the wordCount()
function to produce a list of word counts. We can view the top 15 words by using the takeOrdered()
action; however, since the elements of the RDD are pairs, we need a custom sort function that sorts using the value part of the pair.
You'll notice that many of the words are common English words. These are called stopwords. In a later lab, we will see how to eliminate them from the results.
Use the wordCount()
function and takeOrdered()
to obtain the fifteen most common words and their counts.
# ANSWER
top15WordsAndCounts = wordCount(shakeWordsRDD).takeOrdered(15, key=lambda x: -x[1])
print '\n'.join(map(lambda (w, c): '{0}: {1}'.format(w, c), top15WordsAndCounts))
# TEST Count the words (4f)
Test.assertEquals(top15WordsAndCounts,
[(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),
(u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),
(u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],
'incorrect value for top15WordsAndCounts')
# PRIVATE_TEST Count the words (4f)
Test.assertEquals(top15WordsAndCounts,
[(u'the', 27361), (u'and', 26028), (u'i', 20681), (u'to', 19150), (u'of', 17463),
(u'a', 14593), (u'you', 13615), (u'my', 12481), (u'in', 10956), (u'that', 10890),
(u'is', 9134), (u'not', 8497), (u'with', 7771), (u'me', 7769), (u'it', 7678)],
'incorrect value for top15WordsAndCounts')