DevOps - Setup Spark Cluster - SolutionHacker.com

Provision 4 boxes

There are few options to achieve that:

Use Docker to simulate it – for testing, I like this option better!
Use AWS Spot Instances (much cheaper)
Use physical boxes

Install Hadoop and Java

Check out DevOps – Setup Hadoop Cluster to set it up.

Install Spark

Spark All Nodes Configurations

# make sure Java 8 is installed
allnodes$ sudo apt-get install scala  
allnodes$ java -version 

# install scala
allnodes$ sudo apt-get install scala
allnodes$ scala -version

# install Spark 1.4.1 onto all the nodes by first saving the binary tar files to ~/Downloads and extracting it to the /usr/local folder:
allnodes$ wget http://apache.mirrors.tds.net/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz -P ~/Downloads  
allnodes$ sudo tar zxvf ~/Downloads/spark-* -C /usr/local  
allnodes$ sudo mv /usr/local/spark-* /usr/local/spark  

# setup environment
allnodes$ sudo vi ~/.profile
export SPARK_HOME=/usr/local/spark  
export PATH=$PATH:$SPARK_HOME/bin

# source the profile
allnodes$ . ~/.profile

# change the ownership of the $SPARK_HOME directory to the user ubuntu:
allnodes$ sudo chown -R ubuntu $SPARK_HOME

# common Spark Configurations on all Nodes
allnodes$ cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh  
allnodes$ sudo vi $SPARK_HOME/conf/spark-env.sh
#!/usr/bin/env bash
export JAVA_HOME=/usr  
export SPARK_PUBLIC_DNS="current_node_public_dns"  
export SPARK_WORKER_CORES=6

# make sure Java 8 is installed

allnodes$ sudo apt-get install scala

allnodes$ java -version

# install scala

allnodes$ sudo apt-get install scala

allnodes$ scala -version

# install Spark 1.4.1 onto all the nodes by first saving the binary tar files to ~/Downloads and extracting it to the /usr/local folder:

allnodes$ wget http://apache.mirrors.tds.net/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz -P ~/Downloads

allnodes$ sudo tar zxvf ~/Downloads/spark-* -C /usr/local

allnodes$ sudo mv /usr/local/spark-* /usr/local/spark

# setup environment

allnodes$ sudo vi ~/.profile

export SPARK_HOME=/usr/local/spark

export PATH=$PATH:$SPARK_HOME/bin

# source the profile

allnodes$ . ~/.profile

# change the ownership of the $SPARK_HOME directory to the user ubuntu:

allnodes$ sudo chown -R ubuntu $SPARK_HOME

# common Spark Configurations on all Nodes

allnodes$ cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh

allnodes$ sudo vi $SPARK_HOME/conf/spark-env.sh

#!/usr/bin/env bash

export JAVA_HOME=/usr

export SPARK_PUBLIC_DNS="current_node_public_dns"

export SPARK_WORKER_CORES=6

Spark Master Specific Configurations

# create a slaves file that contains the public DNS's of all the Spark Worker nodes:
spark_master_node$ sudo vi $SPARK_HOME/conf/slaves  
spark_worker1_public_dns  
spark_worker2_public_dns  
spark_worker3_public_dns  

# start your Spark cluster
spark_master_node$ $SPARK_HOME/sbin/start-all.sh

# create a slaves file that contains the public DNS's of all the Spark Worker nodes:

spark_master_node$ sudo vi $SPARK_HOME/conf/slaves

spark_worker1_public_dns

spark_worker2_public_dns

spark_worker3_public_dns

# start your Spark cluster

spark_master_node$ $SPARK_HOME/sbin/start-all.sh

You can go to spark_master_public_dns:8080 in your browser to check if all Worker nodes are online.

Install Jupyter and run RDD

# install Jupyter on Spark Master
spark_master_node$ sudo apt-get install python-dev python-pip python-numpy python-scipy python-pandas gfortran
spark_master_node$ sudo pip install nose "ipython[notebook]"

# setup key to access s3
spark_master_node$ vi ~/.profile
export AWS_ACCESS_KEY_ID=aws_access_key_id
export AWS_SECRET_ACCESS_KEY=aws_secret_access_key

# source the profile
spark_master_node$ . ~/.profile

# start Jupyter notebook server
spark_master_node$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://spark_master_hostname:7077 --executor-memory 6400M --driver-memory 6400M

# port forwarding to access our remote Jupyter server. Choose port 7776 on our local computer
local_computer$ ssh -N -f -L localhost:7776:localhost:7777 ubuntu@spark_master_public_dns

# access Jupyter
http://localhost:7776

# install Jupyter on Spark Master

spark_master_node$ sudo apt-get install python-dev python-pip python-numpy python-scipy python-pandas gfortran

spark_master_node$ sudo pip install nose "ipython[notebook]"

# setup key to access s3

spark_master_node$ vi ~/.profile

export AWS_ACCESS_KEY_ID=aws_access_key_id

export AWS_SECRET_ACCESS_KEY=aws_secret_access_key

# source the profile

spark_master_node$ . ~/.profile

# start Jupyter notebook server

spark_master_node$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://spark_master_hostname:7077 --executor-memory 6400M --driver-memory 6400M

# port forwarding to access our remote Jupyter server. Choose port 7776 on our local computer

local_computer$ ssh -N -f -L localhost:7776:localhost:7777 ubuntu@spark_master_public_dns

# access Jupyter

http://localhost:7776

DevOps – Setup Spark Cluster