#!/usr/bin/env python
# coding: utf-8
# ![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png)
# # **Simple example with Spark**
#
# This notebook illustrates the use of [Spark](https://spark.apache.org) in [SWAN](http://swan.web.cern.ch).
#
# The current setup allows to execute [PySpark](http://spark.apache.org/docs/latest/api/python/) operations on a local standalone Spark instance. This can be used for testing with small datasets.
#
# In the future, SWAN users will be able to attach external Spark clusters to their notebooks, so they can target bigger datasets. Moreover, a Scala Jupyter kernel will be added to use Spark from Scala as well.
# ## Import the necessary modules
# The `pyspark` module is available to perform the necessary imports.
# In[1]:
from pyspark import SparkContext
# ## Create a `SparkContext`
# A `SparkContext` needs to be created before running any Spark operation. This context is linked to the local Spark instance.
# In[2]:
sc = SparkContext()
# ## Run Spark actions and transformations
# Let's use our `SparkContext` to parallelize a list.
# In[13]:
rdd = sc.parallelize([1, 2, 4, 8])
# We can count the number of elements in the list.
# In[14]:
rdd.count()
# Let's now `map` a function to our RDD to increment all its elements.
# In[15]:
rdd.map(lambda x: x + 1).collect()
# We can also calculate the sum of all the elements with `reduce`.
# In[16]:
rdd.reduce(lambda x, y: x + y)