Provision AWS EC2 cluster with Spark version 2.x

This article is to provision a EC2 cluster with Spark and Hadoop. As a result, one should be able to run Spark applications utilizing HDFS file system.

Table of content

Quick and dirty

For impatient reader, run the following script in a quick and dirty way. Remeber to replace access key and key pair with your own.

export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT
export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A
/Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \
        --region=eu-west-1 \
        --instance-type=t2.micro \
        -s 20 \
        --hadoop-major-version=2 \
        launch spark-cluster

Setup the Spark cluster on EC2

  1. Amazon AWS account

    Appearantly there needs to be an Amazon AWS account in order to use EC2 services.

  2. Access key

    Access key allows application to communicate the EC2 servers

    • From AWS front page, click your username on the topright corner, then choose my security credential, choose access key, click create access key.

    • Store the access key id and access key secret as a file.

  3. Key pair

    Key pair will essentially authenticate the applications/scripts with EC2 servers

    • From front page, choose service EC2 on the top left corner, click key pairs, choose create key pair

    • Name the key pair by following some pattern e.g. username+region so that key pairs from different regions will not get mixed

    • Download the file and save as .pem file

    • change permission as user can read

      chmod 400 keypairfile.pem
      
  4. export access key

    • Export access key with the following command using key id and key string generated from previous step

      export AWS_SECRET_ACCESS_KEY=U/y3rO1/wwwzbyUe6wkzNwVG9Qb3uBdxBqiHsmcT
      export AWS_ACCESS_KEY_ID=ABCAJKKNPJJRL74RPY4A
      
  5. Setup a Spark cluster using spark-ec2

    • Spark-ec2 package is no longer part of Spark distribution. So we need to download spark-ec2 package from its Github repository.

    • Setup a spark cluster via the following command. Name the cluster as spark-cluster.

      /Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem \
      --region=eu-west-1 \
      --instance-type=t2.micro \
      -s 5 \
      --hadoop-major-version=2 \
      launch spark-cluster
      
      • Describe key pair file via option -i

      • Specify key name via option -k

      • Give the number of Spark slave nodes via option -s

      • Specify Hadoop version via option –hadoop-major-version

    • Spark version could also be specified as additional options to spark-ec2. Unfortuntely I haven’t figure out a good way to automatically build the Spark package on the master node.

      --spark-version=a2c7b2133cfee7fa9abfaa2bfbfb637155466783 \
      --spark-git-repo=https://github.com/apache/spark \
      

Other cluster operations

  1. Login Spark cluster

    /Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-cluster
    
  2. Stop Spark cluster

    /Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 stop spark-cluster
    
  3. Restart Spark cluster

    /Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 start spark-cluster
    
  4. Destroy Spark cluster

    /Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 destroy spark-cluster
    
  5. UI

    • Spark UI is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:8080
    • Cluster UI is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:5080/ganglia/
    • After history server being started according to the next section, event history server is available from e.g. http://ec2-54-246-255-51.eu-west-1.compute.amazonaws.com:18080

Some useful setting on Spark cluster

  1. Setup event history log

    • Make a directory for event logs

      cd ~
      mkdir /tmp/spark-events
      
    • Start Spark event history server and restart Spark engine

      cd ~/spark/sbin
      ./start-history-server.sh
      ./stop-all.sh ;./start-all.sh 
      
    • Run a Spark application with event log enabled e.g. pySpark

      spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt
      

Run Spark application on Spark cluster

Then I try to run a basic pySpark example on word count of Shakespears.

  1. Login to master node of Spark cluster

    /Users/hongyusu/Codes/Packages/spark-ec2/spark-ec2 -k g1euwest -i g1euwest.pem --region=eu-west-1 login spark-cluster
    

    or

    ssh -i "g1euwest.pem" root@ec2-54-246-255-51.eu-west-1.compute.amazonaws.com
    
  2. Download data and preprocessing

    cd ~
    mkdir tmp; cd tmp
    wget https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt  
    cat t8.shakespeare.txt | sed 's/ /\n/g' > data.txt
    
  3. Move data to HDFS

    ephemeral-hdfs/bin/hadoop dfs -mkdir /data/
    ephemeral-hdfs/bin/hadoop dfs -put ~/tmp/data.txt /data/
    
  4. Run pySpark word count example and activate event history logging.

    spark/bin/pyspark --conf "spark.eventLog.enabled=true" spark/examples/src/main/python/wordcount.py /data/data.txt
    

Alternatives

  1. We could also look into flintrock. I tried and it seems faster than spark-ec2
Hongyu Su 24 September 2017 Helsinki