In this notebook we will go through the traditional Word Count example but we will cover map, flatmap, filter, count, reduceByKey, sortByKey and enhanced word count.
@author: Anindya Saha @email: mail.anindya@gmail.com
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName('wordcount-pyspark').getOrCreate()
spark
SparkSession - in-memory
infile ='data/wordcount.txt'
Read the file:
# read the file
word_rdd = spark.sparkContext.textFile(infile).cache()
1. Read the file, print every line:
data = word_rdd
# print each line of the book
for line in data.collect():
print(line)
Management (or managing) is the administration of an organization, whether it be a business, a not-for-profit organization, or government body. Management includes the activities of setting the strategy of an organization and coordinating the efforts of its employees or volunteers to accomplish its objectives through the application of available resources, such as financial, natural, technological, and human resources. The term "management" may also refer to the people who manage an organization. Management is also an academic discipline, a social science whose objective is to study social organization and organizational leadership. Management is studied at colleges and universities; some important degrees in management are the Bachelor of Commerce (B.Com.) and Master of Business Administration (M.B.A.) and, for the public sector, the Master of Public Administration (MPA) degree. Individuals who aim at becoming management researchers or professors may complete the Doctor of Business Administration (DBA) or the PhD in business administration or management. In larger organizations, there are generally three levels of managers, which are typically organized in a hierarchical, pyramid structure. Senior managers, such as the Board of Directors, Chief Executive Officer (CEO) or President of an organization, set the strategic goals of the organization and make decisions on how the overall organization will operate. Senior managers provide direction to the middle managers who report to them. Middle managers, examples of which would include branch managers, regional managers and section managers, provide direction to front-line managers. Middle managers communicate the strategic goals of senior management to the front-line managers. Lower managers, such as supervisors and front-line team leaders, oversee the work of regular employees (or volunteers, in some voluntary organizations) and provide direction on their work. In smaller organizations, the roles of managers have much wider scopes. A manager can perform several roles or even all of the roles commonly observed in a large organization. There are many more smaller organizations than larger ones.
2. Read the file, print every word:
data = word_rdd.flatMap(lambda line: line.split(" "))
# print each and every word
print(data.collect())
['Management', '(or', 'managing)', 'is', 'the', 'administration', 'of', 'an', 'organization,', 'whether', 'it', 'be', 'a', 'business,', 'a', 'not-for-profit', 'organization,', 'or', 'government', 'body.', 'Management', 'includes', 'the', 'activities', 'of', 'setting', 'the', 'strategy', 'of', 'an', 'organization', 'and', 'coordinating', 'the', 'efforts', 'of', 'its', 'employees', 'or', 'volunteers', 'to', 'accomplish', 'its', 'objectives', 'through', 'the', 'application', 'of', 'available', 'resources,', 'such', 'as', 'financial,', 'natural,', 'technological,', 'and', 'human', 'resources.', 'The', 'term', '"management"', 'may', 'also', 'refer', 'to', 'the', 'people', 'who', 'manage', 'an', 'organization.', 'Management', 'is', 'also', 'an', 'academic', 'discipline,', 'a', 'social', 'science', 'whose', 'objective', 'is', 'to', 'study', 'social', 'organization', 'and', 'organizational', 'leadership.', 'Management', 'is', 'studied', 'at', 'colleges', 'and', 'universities;', 'some', 'important', 'degrees', 'in', 'management', 'are', 'the', 'Bachelor', 'of', 'Commerce', '(B.Com.)', 'and', 'Master', 'of', 'Business', 'Administration', '(M.B.A.)', 'and,', 'for', 'the', 'public', 'sector,', 'the', 'Master', 'of', 'Public', 'Administration', '(MPA)', 'degree.', 'Individuals', 'who', 'aim', 'at', 'becoming', 'management', 'researchers', 'or', 'professors', 'may', 'complete', 'the', 'Doctor', 'of', 'Business', 'Administration', '(DBA)', 'or', 'the', 'PhD', 'in', 'business', 'administration', 'or', 'management.', 'In', 'larger', 'organizations,', 'there', 'are', 'generally', 'three', 'levels', 'of', 'managers,', 'which', 'are', 'typically', 'organized', 'in', 'a', 'hierarchical,', 'pyramid', 'structure.', 'Senior', 'managers,', 'such', 'as', 'the', 'Board', 'of', 'Directors,', 'Chief', 'Executive', 'Officer', '(CEO)', 'or', 'President', 'of', 'an', 'organization,', 'set', 'the', 'strategic', 'goals', 'of', 'the', 'organization', 'and', 'make', 'decisions', 'on', 'how', 'the', 'overall', 'organization', 'will', 'operate.', 'Senior', 'managers', 'provide', 'direction', 'to', 'the', 'middle', 'managers', 'who', 'report', 'to', 'them.', 'Middle', 'managers,', 'examples', 'of', 'which', 'would', 'include', 'branch', 'managers,', 'regional', 'managers', 'and', 'section', 'managers,', 'provide', 'direction', 'to', 'front-line', 'managers.', 'Middle', 'managers', 'communicate', 'the', 'strategic', 'goals', 'of', 'senior', 'management', 'to', 'the', 'front-line', 'managers.', 'Lower', 'managers,', 'such', 'as', 'supervisors', 'and', 'front-line', 'team', 'leaders,', 'oversee', 'the', 'work', 'of', 'regular', 'employees', '(or', 'volunteers,', 'in', 'some', 'voluntary', 'organizations)', 'and', 'provide', 'direction', 'on', 'their', 'work.', 'In', 'smaller', 'organizations,', 'the', 'roles', 'of', 'managers', 'have', 'much', 'wider', 'scopes.', 'A', 'manager', 'can', 'perform', 'several', 'roles', 'or', 'even', 'all', 'of', 'the', 'roles', 'commonly', 'observed', 'in', 'a', 'large', 'organization.', 'There', 'are', 'many', 'more', 'smaller', 'organizations', 'than', 'larger', 'ones.']
3. Read the file, generate words, filter out "empty" word and print each word:
data = (word_rdd
.flatMap(lambda line: line.split(" "))
.filter(lambda word: len(word) > 0)
)
# print each and every word
print(data.collect())
['Management', '(or', 'managing)', 'is', 'the', 'administration', 'of', 'an', 'organization,', 'whether', 'it', 'be', 'a', 'business,', 'a', 'not-for-profit', 'organization,', 'or', 'government', 'body.', 'Management', 'includes', 'the', 'activities', 'of', 'setting', 'the', 'strategy', 'of', 'an', 'organization', 'and', 'coordinating', 'the', 'efforts', 'of', 'its', 'employees', 'or', 'volunteers', 'to', 'accomplish', 'its', 'objectives', 'through', 'the', 'application', 'of', 'available', 'resources,', 'such', 'as', 'financial,', 'natural,', 'technological,', 'and', 'human', 'resources.', 'The', 'term', '"management"', 'may', 'also', 'refer', 'to', 'the', 'people', 'who', 'manage', 'an', 'organization.', 'Management', 'is', 'also', 'an', 'academic', 'discipline,', 'a', 'social', 'science', 'whose', 'objective', 'is', 'to', 'study', 'social', 'organization', 'and', 'organizational', 'leadership.', 'Management', 'is', 'studied', 'at', 'colleges', 'and', 'universities;', 'some', 'important', 'degrees', 'in', 'management', 'are', 'the', 'Bachelor', 'of', 'Commerce', '(B.Com.)', 'and', 'Master', 'of', 'Business', 'Administration', '(M.B.A.)', 'and,', 'for', 'the', 'public', 'sector,', 'the', 'Master', 'of', 'Public', 'Administration', '(MPA)', 'degree.', 'Individuals', 'who', 'aim', 'at', 'becoming', 'management', 'researchers', 'or', 'professors', 'may', 'complete', 'the', 'Doctor', 'of', 'Business', 'Administration', '(DBA)', 'or', 'the', 'PhD', 'in', 'business', 'administration', 'or', 'management.', 'In', 'larger', 'organizations,', 'there', 'are', 'generally', 'three', 'levels', 'of', 'managers,', 'which', 'are', 'typically', 'organized', 'in', 'a', 'hierarchical,', 'pyramid', 'structure.', 'Senior', 'managers,', 'such', 'as', 'the', 'Board', 'of', 'Directors,', 'Chief', 'Executive', 'Officer', '(CEO)', 'or', 'President', 'of', 'an', 'organization,', 'set', 'the', 'strategic', 'goals', 'of', 'the', 'organization', 'and', 'make', 'decisions', 'on', 'how', 'the', 'overall', 'organization', 'will', 'operate.', 'Senior', 'managers', 'provide', 'direction', 'to', 'the', 'middle', 'managers', 'who', 'report', 'to', 'them.', 'Middle', 'managers,', 'examples', 'of', 'which', 'would', 'include', 'branch', 'managers,', 'regional', 'managers', 'and', 'section', 'managers,', 'provide', 'direction', 'to', 'front-line', 'managers.', 'Middle', 'managers', 'communicate', 'the', 'strategic', 'goals', 'of', 'senior', 'management', 'to', 'the', 'front-line', 'managers.', 'Lower', 'managers,', 'such', 'as', 'supervisors', 'and', 'front-line', 'team', 'leaders,', 'oversee', 'the', 'work', 'of', 'regular', 'employees', '(or', 'volunteers,', 'in', 'some', 'voluntary', 'organizations)', 'and', 'provide', 'direction', 'on', 'their', 'work.', 'In', 'smaller', 'organizations,', 'the', 'roles', 'of', 'managers', 'have', 'much', 'wider', 'scopes.', 'A', 'manager', 'can', 'perform', 'several', 'roles', 'or', 'even', 'all', 'of', 'the', 'roles', 'commonly', 'observed', 'in', 'a', 'large', 'organization.', 'There', 'are', 'many', 'more', 'smaller', 'organizations', 'than', 'larger', 'ones.']
4. Read the file, generate words, trim each word, put in lowercase, replace special char, filter out "empty" word and print each word:
import re
data = (word_rdd
.flatMap(lambda line: line.split(" "))
.map(lambda word: word.strip().lower())
.map(lambda word: re.sub("[,.:;'\"\\?\\-!\\(\\)]", "", word))
.filter(lambda word: len(word) > 2)
)
# print each and every word
print(data.collect())
['management', 'managing', 'the', 'administration', 'organization', 'whether', 'business', 'notforprofit', 'organization', 'government', 'body', 'management', 'includes', 'the', 'activities', 'setting', 'the', 'strategy', 'organization', 'and', 'coordinating', 'the', 'efforts', 'its', 'employees', 'volunteers', 'accomplish', 'its', 'objectives', 'through', 'the', 'application', 'available', 'resources', 'such', 'financial', 'natural', 'technological', 'and', 'human', 'resources', 'the', 'term', 'management', 'may', 'also', 'refer', 'the', 'people', 'who', 'manage', 'organization', 'management', 'also', 'academic', 'discipline', 'social', 'science', 'whose', 'objective', 'study', 'social', 'organization', 'and', 'organizational', 'leadership', 'management', 'studied', 'colleges', 'and', 'universities', 'some', 'important', 'degrees', 'management', 'are', 'the', 'bachelor', 'commerce', 'bcom', 'and', 'master', 'business', 'administration', 'mba', 'and', 'for', 'the', 'public', 'sector', 'the', 'master', 'public', 'administration', 'mpa', 'degree', 'individuals', 'who', 'aim', 'becoming', 'management', 'researchers', 'professors', 'may', 'complete', 'the', 'doctor', 'business', 'administration', 'dba', 'the', 'phd', 'business', 'administration', 'management', 'larger', 'organizations', 'there', 'are', 'generally', 'three', 'levels', 'managers', 'which', 'are', 'typically', 'organized', 'hierarchical', 'pyramid', 'structure', 'senior', 'managers', 'such', 'the', 'board', 'directors', 'chief', 'executive', 'officer', 'ceo', 'president', 'organization', 'set', 'the', 'strategic', 'goals', 'the', 'organization', 'and', 'make', 'decisions', 'how', 'the', 'overall', 'organization', 'will', 'operate', 'senior', 'managers', 'provide', 'direction', 'the', 'middle', 'managers', 'who', 'report', 'them', 'middle', 'managers', 'examples', 'which', 'would', 'include', 'branch', 'managers', 'regional', 'managers', 'and', 'section', 'managers', 'provide', 'direction', 'frontline', 'managers', 'middle', 'managers', 'communicate', 'the', 'strategic', 'goals', 'senior', 'management', 'the', 'frontline', 'managers', 'lower', 'managers', 'such', 'supervisors', 'and', 'frontline', 'team', 'leaders', 'oversee', 'the', 'work', 'regular', 'employees', 'volunteers', 'some', 'voluntary', 'organizations', 'and', 'provide', 'direction', 'their', 'work', 'smaller', 'organizations', 'the', 'roles', 'managers', 'have', 'much', 'wider', 'scopes', 'manager', 'can', 'perform', 'several', 'roles', 'even', 'all', 'the', 'roles', 'commonly', 'observed', 'large', 'organization', 'there', 'are', 'many', 'more', 'smaller', 'organizations', 'than', 'larger', 'ones']
5. Read the file, generate words, trim each word, put in lower case, replace special char, filter out "empty" word, count each word and print the word and the count:
import re
data = (word_rdd
.flatMap(lambda line: line.split(" "))
.map(lambda word: word.strip().lower())
.map(lambda word: re.sub("[,.:;'\"\\?\\-!\\(\\)]", "", word))
.filter(lambda word: len(word) > 2)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
)
# print each and every word
print(data.collect())
[('leadership', 1), ('human', 1), ('bcom', 1), ('provide', 3), ('even', 1), ('perform', 1), ('set', 1), ('than', 1), ('several', 1), ('levels', 1), ('science', 1), ('officer', 1), ('ceo', 1), ('typically', 1), ('large', 1), ('becoming', 1), ('commerce', 1), ('resources', 2), ('pyramid', 1), ('examples', 1), ('voluntary', 1), ('objectives', 1), ('larger', 2), ('may', 2), ('managers', 13), ('leaders', 1), ('make', 1), ('manager', 1), ('middle', 3), ('there', 2), ('strategic', 2), ('financial', 1), ('chief', 1), ('overall', 1), ('researchers', 1), ('have', 1), ('volunteers', 2), ('social', 2), ('more', 1), ('roles', 3), ('work', 2), ('efforts', 1), ('operate', 1), ('degrees', 1), ('communicate', 1), ('term', 1), ('professors', 1), ('would', 1), ('three', 1), ('include', 1), ('refer', 1), ('team', 1), ('bachelor', 1), ('master', 2), ('doctor', 1), ('whose', 1), ('individuals', 1), ('sector', 1), ('academic', 1), ('discipline', 1), ('public', 2), ('management', 9), ('organizations', 4), ('lower', 1), ('scopes', 1), ('board', 1), ('senior', 3), ('mpa', 1), ('direction', 3), ('regional', 1), ('them', 1), ('executive', 1), ('organizational', 1), ('regular', 1), ('are', 4), ('business', 4), ('coordinating', 1), ('oversee', 1), ('activities', 1), ('dba', 1), ('commonly', 1), ('important', 1), ('can', 1), ('frontline', 3), ('how', 1), ('for', 1), ('all', 1), ('smaller', 2), ('its', 2), ('wider', 1), ('natural', 1), ('the', 22), ('who', 3), ('complete', 1), ('president', 1), ('observed', 1), ('managing', 1), ('report', 1), ('available', 1), ('many', 1), ('supervisors', 1), ('government', 1), ('colleges', 1), ('will', 1), ('degree', 1), ('structure', 1), ('such', 3), ('strategy', 1), ('branch', 1), ('phd', 1), ('and', 10), ('some', 2), ('which', 2), ('hierarchical', 1), ('goals', 2), ('includes', 1), ('accomplish', 1), ('much', 1), ('organized', 1), ('administration', 5), ('aim', 1), ('also', 2), ('body', 1), ('directors', 1), ('mba', 1), ('study', 1), ('section', 1), ('decisions', 1), ('through', 1), ('setting', 1), ('generally', 1), ('their', 1), ('objective', 1), ('universities', 1), ('whether', 1), ('employees', 2), ('technological', 1), ('organization', 9), ('ones', 1), ('people', 1), ('manage', 1), ('studied', 1), ('application', 1), ('notforprofit', 1)]
6. Read the file, generate words, trim each word, put in lower case, replace special char, filter out "empty" word, count each word, sort by count ASC and print the word and the count:
import re
data = (word_rdd
.flatMap(lambda line: line.split(" "))
.map(lambda word: word.strip().lower())
.map(lambda word: re.sub("[,.:;'\"\\?\\-!\\(\\)]", "", word))
.filter(lambda word: len(word) > 2)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.map(lambda wc: (wc[1], wc[0]))
.sortByKey(False)
)
# print each and every word
print(data.collect())
[(22, 'the'), (13, 'managers'), (10, 'and'), (9, 'management'), (9, 'organization'), (5, 'administration'), (4, 'organizations'), (4, 'are'), (4, 'business'), (3, 'provide'), (3, 'middle'), (3, 'roles'), (3, 'senior'), (3, 'direction'), (3, 'frontline'), (3, 'who'), (3, 'such'), (2, 'resources'), (2, 'larger'), (2, 'may'), (2, 'there'), (2, 'strategic'), (2, 'volunteers'), (2, 'social'), (2, 'work'), (2, 'master'), (2, 'public'), (2, 'smaller'), (2, 'its'), (2, 'some'), (2, 'which'), (2, 'goals'), (2, 'also'), (2, 'employees'), (1, 'leadership'), (1, 'human'), (1, 'bcom'), (1, 'even'), (1, 'perform'), (1, 'set'), (1, 'than'), (1, 'several'), (1, 'levels'), (1, 'science'), (1, 'officer'), (1, 'ceo'), (1, 'typically'), (1, 'large'), (1, 'becoming'), (1, 'commerce'), (1, 'pyramid'), (1, 'examples'), (1, 'voluntary'), (1, 'objectives'), (1, 'leaders'), (1, 'make'), (1, 'manager'), (1, 'financial'), (1, 'chief'), (1, 'overall'), (1, 'researchers'), (1, 'have'), (1, 'more'), (1, 'efforts'), (1, 'operate'), (1, 'degrees'), (1, 'communicate'), (1, 'term'), (1, 'professors'), (1, 'would'), (1, 'three'), (1, 'include'), (1, 'refer'), (1, 'team'), (1, 'bachelor'), (1, 'doctor'), (1, 'whose'), (1, 'individuals'), (1, 'sector'), (1, 'academic'), (1, 'discipline'), (1, 'lower'), (1, 'scopes'), (1, 'board'), (1, 'mpa'), (1, 'regional'), (1, 'them'), (1, 'executive'), (1, 'organizational'), (1, 'regular'), (1, 'coordinating'), (1, 'oversee'), (1, 'activities'), (1, 'dba'), (1, 'commonly'), (1, 'important'), (1, 'can'), (1, 'how'), (1, 'for'), (1, 'all'), (1, 'wider'), (1, 'natural'), (1, 'complete'), (1, 'president'), (1, 'observed'), (1, 'managing'), (1, 'report'), (1, 'available'), (1, 'many'), (1, 'supervisors'), (1, 'government'), (1, 'colleges'), (1, 'will'), (1, 'degree'), (1, 'structure'), (1, 'strategy'), (1, 'branch'), (1, 'phd'), (1, 'hierarchical'), (1, 'includes'), (1, 'accomplish'), (1, 'much'), (1, 'organized'), (1, 'aim'), (1, 'body'), (1, 'directors'), (1, 'mba'), (1, 'study'), (1, 'section'), (1, 'decisions'), (1, 'through'), (1, 'setting'), (1, 'generally'), (1, 'their'), (1, 'objective'), (1, 'universities'), (1, 'whether'), (1, 'technological'), (1, 'ones'), (1, 'people'), (1, 'manage'), (1, 'studied'), (1, 'application'), (1, 'notforprofit')]
**
7 .
a. Read the file, generate words, replace special char, trim and lowercase, filter out "empty", count each word, sort by count ASC
b. store the result in memory
c. print the word and the count
d. print the word only
e. print the number of word that start with each char for only the word with more than 5 occurrences. The result is sorted by count ASC
**
# a. generate words, replace special char, trim and lowercase, filter out "empty", count each word, sort by count ASC
import re
data = (word_rdd
.flatMap(lambda line: line.split(" "))
.map(lambda word: word.strip().lower())
.map(lambda word: re.sub("[,.:;'\"\\?\\-!\\(\\)]", "", word))
.filter(lambda word: len(word) > 2)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.map(lambda wc: (wc[1], wc[0]))
.sortByKey(False)
)
# b. store the result in memory
from pyspark import StorageLevel
data.persist(StorageLevel.MEMORY_ONLY)
PythonRDD[31] at RDD at PythonRDD.scala:48
# c. print the word and the count
print(data.map(lambda wc: (wc[1], wc[0])).collect())
[('the', 22), ('managers', 13), ('and', 10), ('management', 9), ('organization', 9), ('administration', 5), ('organizations', 4), ('are', 4), ('business', 4), ('provide', 3), ('middle', 3), ('roles', 3), ('senior', 3), ('direction', 3), ('frontline', 3), ('who', 3), ('such', 3), ('resources', 2), ('larger', 2), ('may', 2), ('there', 2), ('strategic', 2), ('volunteers', 2), ('social', 2), ('work', 2), ('master', 2), ('public', 2), ('smaller', 2), ('its', 2), ('some', 2), ('which', 2), ('goals', 2), ('also', 2), ('employees', 2), ('leadership', 1), ('human', 1), ('bcom', 1), ('even', 1), ('perform', 1), ('set', 1), ('than', 1), ('several', 1), ('levels', 1), ('science', 1), ('officer', 1), ('ceo', 1), ('typically', 1), ('large', 1), ('becoming', 1), ('commerce', 1), ('pyramid', 1), ('examples', 1), ('voluntary', 1), ('objectives', 1), ('leaders', 1), ('make', 1), ('manager', 1), ('financial', 1), ('chief', 1), ('overall', 1), ('researchers', 1), ('have', 1), ('more', 1), ('efforts', 1), ('operate', 1), ('degrees', 1), ('communicate', 1), ('term', 1), ('professors', 1), ('would', 1), ('three', 1), ('include', 1), ('refer', 1), ('team', 1), ('bachelor', 1), ('doctor', 1), ('whose', 1), ('individuals', 1), ('sector', 1), ('academic', 1), ('discipline', 1), ('lower', 1), ('scopes', 1), ('board', 1), ('mpa', 1), ('regional', 1), ('them', 1), ('executive', 1), ('organizational', 1), ('regular', 1), ('coordinating', 1), ('oversee', 1), ('activities', 1), ('dba', 1), ('commonly', 1), ('important', 1), ('can', 1), ('how', 1), ('for', 1), ('all', 1), ('wider', 1), ('natural', 1), ('complete', 1), ('president', 1), ('observed', 1), ('managing', 1), ('report', 1), ('available', 1), ('many', 1), ('supervisors', 1), ('government', 1), ('colleges', 1), ('will', 1), ('degree', 1), ('structure', 1), ('strategy', 1), ('branch', 1), ('phd', 1), ('hierarchical', 1), ('includes', 1), ('accomplish', 1), ('much', 1), ('organized', 1), ('aim', 1), ('body', 1), ('directors', 1), ('mba', 1), ('study', 1), ('section', 1), ('decisions', 1), ('through', 1), ('setting', 1), ('generally', 1), ('their', 1), ('objective', 1), ('universities', 1), ('whether', 1), ('technological', 1), ('ones', 1), ('people', 1), ('manage', 1), ('studied', 1), ('application', 1), ('notforprofit', 1)]
# d. print the word only
print(data.map(lambda wc: wc[1]).collect())
['the', 'managers', 'and', 'management', 'organization', 'administration', 'organizations', 'are', 'business', 'provide', 'middle', 'roles', 'senior', 'direction', 'frontline', 'who', 'such', 'resources', 'larger', 'may', 'there', 'strategic', 'volunteers', 'social', 'work', 'master', 'public', 'smaller', 'its', 'some', 'which', 'goals', 'also', 'employees', 'leadership', 'human', 'bcom', 'even', 'perform', 'set', 'than', 'several', 'levels', 'science', 'officer', 'ceo', 'typically', 'large', 'becoming', 'commerce', 'pyramid', 'examples', 'voluntary', 'objectives', 'leaders', 'make', 'manager', 'financial', 'chief', 'overall', 'researchers', 'have', 'more', 'efforts', 'operate', 'degrees', 'communicate', 'term', 'professors', 'would', 'three', 'include', 'refer', 'team', 'bachelor', 'doctor', 'whose', 'individuals', 'sector', 'academic', 'discipline', 'lower', 'scopes', 'board', 'mpa', 'regional', 'them', 'executive', 'organizational', 'regular', 'coordinating', 'oversee', 'activities', 'dba', 'commonly', 'important', 'can', 'how', 'for', 'all', 'wider', 'natural', 'complete', 'president', 'observed', 'managing', 'report', 'available', 'many', 'supervisors', 'government', 'colleges', 'will', 'degree', 'structure', 'strategy', 'branch', 'phd', 'hierarchical', 'includes', 'accomplish', 'much', 'organized', 'aim', 'body', 'directors', 'mba', 'study', 'section', 'decisions', 'through', 'setting', 'generally', 'their', 'objective', 'universities', 'whether', 'technological', 'ones', 'people', 'manage', 'studied', 'application', 'notforprofit']
# e. print the number of word that start with each char for only the word with more than 5 occurrences.
# The result is sorted by count ASC
data.filter(lambda cw: cw[0] > 5).map(lambda cw: (cw[1][0], cw[0])).reduceByKey(lambda a, b: a + b).map(lambda cw: (cw[1], cw[0])).sortByKey().collect()
[(9, 'o'), (10, 'a'), (22, 'm'), (22, 't')]
**
8 .
a. Read the file, generate words, count each word
b. store the word splits in a file
c. store the word counts in a file
**
# a. generate words, replace special char, trim and lowercase, filter out "empty", count each word, sort by count ASC
import re
splits = (word_rdd
.flatMap(lambda line: line.split(" "))
.map(lambda word: word.strip().lower())
.map(lambda word: re.sub("[,.:;'\"\\?\\-!\\(\\)]", "", word))
.filter(lambda word: len(word) > 2)
.map(lambda word: (word, 1))
)
# b. store the word splits in a file
splits.coalesce(1).saveAsTextFile('splitoutput')
counts = splits.reduceByKey(lambda a, b: a + b)
# c. store the word counts in a file
counts.coalesce(1).saveAsTextFile('countoutput')
spark.stop()