This article is to provision a EC2 cluster with Spark and Hadoop. As a result, one should be able to run Spark applications utilizing HDFS file system.
For impatient reader, run the following script in a quick and dirty way. Remeber to replace access key and key pair with your own.
export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT
export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \
--region=eu-west-1 \
--instance-type=t2.micro \
-s 20 \
--hadoop-major-version=2 \
launch spark-cluster
Amazon AWS account
Appearantly there needs to be an Amazon AWS account in order to use EC2 services.
Access key
Access key allows application to communicate the EC2 servers
From AWS front page, click your username on the topright corner, then choose my security credential, choose access key, click create access key.
Store the access key id and access key secret as a file.
Key pair
Key pair will essentially authenticate the applications/scripts with EC2 servers
From front page, choose service EC2 on the top left corner, click key pairs, choose create key pair
Name the key pair by following some pattern e.g. username+region so that key pairs from different regions will not get mixed
Download the file and save as .pem file
change permission as user can read
chmod 400 keypairfile.pem
export access key
Export access key with the following command using key id and key string generated from previous step
export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT
export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A
Setup a Spark cluster using spark-ec2
Spark-ec2 package is no longer part of Spark distribution. So we need to download spark-ec2 package from its Github repository.
Setup a spark cluster via the following command. Name the cluster as spark-cluster.
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \
--region=eu-west-1 \
--instance-type=t2.micro \
-s 5 \
--hadoop-major-version=2 \
launch spark-cluster
Describe key pair file via option -i
Specify key name via option -k
Give the number of Spark slave nodes via option -s
Specify Hadoop version via option –hadoop-major-version
Spark version could also be specified as additional options to spark-ec2. Unfortuntely I haven’t figure out a good way to automatically build the Spark package on the master node.
--spark-version=a2c7b2133cfee7fa9abfaa2bfbfb637155466783 \
--spark-git-repo=https://github.com/apache/spark \
Login Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-cluster
Stop Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 stop spark-cluster
Restart Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 start spark-cluster
Destroy Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 destroy spark-cluster
UI
Setup event history log
Make a directory for event logs
cd ~
mkdir /tmp/spark-events
Start Spark event history server and restart Spark engine
cd ~/spark/sbin
./start-history-server.sh
./stop-all.sh ;./start-all.sh
Run a Spark application with event log enabled e.g. pySpark
spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt
Then I try to run a basic pySpark example on word count of Shakespears.
Login to master node of Spark cluster
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-cluster
or
ssh -i "g1euwest.pem" root@ec2-54-246-255-51.eu-west-1.compute.amazonaws.com
Download data and preprocessing
cd ~
mkdir tmp; cd tmp
wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
cat t8.shakespeare.txt | sed 's/ /\n/g' > data.txt
Move data to HDFS
ephemeral-hdfs/bin/hadoop dfs -mkdir /data/
ephemeral-hdfs/bin/hadoop dfs -put ~/tmp/data.txt /data/
Run pySpark word count example and activate event history logging.
spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt