Provision 4 boxes
There are few options to achieve that:
- Use Docker to simulate it – for testing, I like this option better!
- Use AWS Spot Instances (much cheaper)
- Use physical boxes
Install Hadoop and Java
Check out DevOps – Setup Hadoop Cluster to set it up.
Install Spark
Spark All Nodes Configurations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# make sure Java 8 is installed allnodes$ sudo apt-get install scala allnodes$ java -version # install scala allnodes$ sudo apt-get install scala allnodes$ scala -version # install Spark 1.4.1 onto all the nodes by first saving the binary tar files to ~/Downloads and extracting it to the /usr/local folder: allnodes$ wget http://apache.mirrors.tds.net/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz -P ~/Downloads allnodes$ sudo tar zxvf ~/Downloads/spark-* -C /usr/local allnodes$ sudo mv /usr/local/spark-* /usr/local/spark # setup environment allnodes$ sudo vi ~/.profile export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/bin # source the profile allnodes$ . ~/.profile # change the ownership of the $SPARK_HOME directory to the user ubuntu: allnodes$ sudo chown -R ubuntu $SPARK_HOME # common Spark Configurations on all Nodes allnodes$ cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh allnodes$ sudo vi $SPARK_HOME/conf/spark-env.sh #!/usr/bin/env bash export JAVA_HOME=/usr export SPARK_PUBLIC_DNS="current_node_public_dns" export SPARK_WORKER_CORES=6 |
Spark Master Specific Configurations
1 2 3 4 5 6 7 8 |
# create a slaves file that contains the public DNS's of all the Spark Worker nodes: spark_master_node$ sudo vi $SPARK_HOME/conf/slaves spark_worker1_public_dns spark_worker2_public_dns spark_worker3_public_dns # start your Spark cluster spark_master_node$ $SPARK_HOME/sbin/start-all.sh |
You can go to spark_master_public_dns:8080 in your browser to check if all Worker nodes are online.
Install Jupyter and run RDD
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# install Jupyter on Spark Master spark_master_node$ sudo apt-get install python-dev python-pip python-numpy python-scipy python-pandas gfortran spark_master_node$ sudo pip install nose "ipython[notebook]" # setup key to access s3 spark_master_node$ vi ~/.profile export AWS_ACCESS_KEY_ID=aws_access_key_id export AWS_SECRET_ACCESS_KEY=aws_secret_access_key # source the profile spark_master_node$ . ~/.profile # start Jupyter notebook server spark_master_node$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://spark_master_hostname:7077 --executor-memory 6400M --driver-memory 6400M # port forwarding to access our remote Jupyter server. Choose port 7776 on our local computer local_computer$ ssh -N -f -L localhost:7776:localhost:7777 ubuntu@spark_master_public_dns # access Jupyter http://localhost:7776 |
Connect with us