Create an AWS account¶
Create an IAM user¶
- Log in (or create a new account) and click on the Identity & Access Management link under the Security & Identity section.
- Select Users on the menu on the left and click on the Create New Users button.
- Use cs109 for the user name and make sure the Generate an access key for each user checkbox is selected.
- Click the Download Credentials button to get the new credential keys. Once the file is downloaded, click the Close button.
- Click on the newly created cs109 user.
- Scroll down until you see the Attach Policy button and click on it.
- Search and select the AdministratorAccess policy, and click the Attach Policy button.
- Go to http://docs.aws.amazon.com/cli/latest/userguide/installing.html and find the instructions for the platform you're using.
- Run the following on the command line:
aws configure
- Fill out the requested information (replace the ??? bellow with the values from the credentials file):
AWS Access Key ID [None]: ???
AWS Secret Access Key [None]: ???
Default region name [None]: us-east-1
Default output format [None]: json
- Run the following on the command line:
aws emr create-default-roles
You should get a big JSON string as the output of this command.¶
Create an EC2 SSH key pair¶
- Run the following on the command line:
aws ec2 create-key-pair --key-name CS109 --query 'KeyMaterial' --output text > CS109.pem
chmod 400 CS109.pem
You should now have a file called CS109.pem on your current directory.¶
Create the Spark cluster¶
- Run the following on the command line:
export CLUSTER_ID=`aws emr create-cluster --name "CS109 Spark cluster" \
--release-label emr-4.1.0 --applications Name=Spark --ec2-attributes KeyName=CS109 \
--instance-type m1.large --instance-count 3 --use-default-roles \
--bootstrap-actions Path=s3://cs109-2015/install-anaconda-emr,Name=Install_Anaconda \
--query 'ClusterId' --output text` && echo $CLUSTER_ID
The output of this command will be something like the following (your actual value will be different):
j-33S87OUETACNK
It will take a few minutes for your cluster to be ready. Go watch some cat videos on YouTube and come back in 20 minutes or so.
Check cluster status¶
- Run the following on the command line:
aws emr describe-cluster --cluster-id $CLUSTER_ID --query 'Cluster.Status.State' --output text
If the output is anything but WAITING, your cluster is not ready yet. Wait a few more minutes and run the command again. Do not proceeed until the cluster is ready.
Connect to the iPython Notebook:¶
Make sure the cluster is ready before proceedeing.¶
- Get the cluster master's IP:
export DNS_NAME=`aws emr describe-cluster --cluster-id $CLUSTER_ID \
--query 'Cluster.MasterPublicDnsName' --output text` && echo $DNS_NAME
- Run the script to configure Spark:
ssh -o ServerAliveInterval=10 -i CS109.pem hadoop@$DNS_NAME 'sh configure-spark.sh'
- Create an SSH tunel to the AWS box (this assumes your SSH key is on the same directory you are invoking the SSH command from).
ssh -o ServerAliveInterval=10 -i CS109.pem hadoop@$DNS_NAME -L 8989:localhost:8888
- The previous command will create an SSH connection to the Spark cluster and a tunnel to access the notebook. Run the following command on the SSH section:
pyspark
- Open your browser and got to http://localhost:8989 (or just click on this link)
Do your thing...¶
Terminate the Spark cluster¶
Press CTRL-C twice to terminate iPython.
Type exit on the command line to exit the SSH session.
Run the following on the command line:
aws emr terminate-clusters --cluster-ids $CLUSTER_ID
- Optionally, run the following two commands to remove your keys. This is only needed if you don't plan on creating new clusters.
aws ec2 delete-key-pair --key-name CS109
rm CS109.pem
Make sure the cluster was terminated¶
- Run the following on the command line:
aws emr describe-cluster --cluster-id $CLUSTER_ID --query 'Cluster.Status.State' --output text
If the output is anything different from TERMINATING or TERMINATED, re-run the command above or go to the AWS console and terminate the cluster manually.
YOU MUST TERMINATE THE CLUSTER OR YOU WILL BE CHARGED!!!¶