Learning Spark

Introducing Spark

At the high level, Spark is a distributed programming model that allows you to distribute tasks into different worker nodes from a cluster. And each worker node (ie. executor) has a number of slots that can run more than one tasks in parallel. After a task is finished, the executor can store the result to the cache. By the caching mechanism that holds previous computation result in memory, Spark outperforms Hadoop significantly because it doesn’t need to persist all the data into disk for each round of parallel processing. Even though Spark is relatively new, it is one of the hottest open-source technologies at the moment and has begun to surpass Hadoop’s MapReduce model. This is partly because Spark’s Resilient Distributed Dataset (RDD) model can do everything the MapReduce paradigm can, and more. While the speed is mainly due to its smart use of cache in each worker nodes, the power of Spark is mainly attributed to its RDD model. RDD is a data reference that serves as a logical reference of a dataset partitioned across many server machines in the cluster. Because of that, the driver program doesn’t need to load the physical data from other nodes. And the code after optimized in form of DAG is moved to all worker nodes instead. By moving the code to the data, the volume of data transfer over network is significantly reduced. This is an important paradigm shift for big data processing.

How Spark Executes Your Program

To better understand the model, let’s look at how it works in steps below:

  • As you enter your code in spark console, the driver program interprets the code that creates RDD and applies operators. The operators can be either transformation or action operations.
  • When the driver program starts its execution, it builds up a graph (ie. DAG) where nodes are RDD and edges are transformation steps. However, no execution is happening at the cluster until an action is encountered.
  • When the user runs an action (like collect), the graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages.
  • A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For example, many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
  • The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/ Yarn/ Mesos). The task scheduler doesn’t know about dependencies among stages.
  • The Worker executes the tasks. The worker knows only about the code that is passed to it.

RDD and its Properties

Resilient Distributed Dataset or RDD is a core component of Apache Spark framework. Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Let’s understand RDD by its name:

  • Resilient: i.e., Fault-tolerant. With the help of LINEAGE Graph, RDD can achieve fault tolerance.
  • Distributed: i.e., Data shared on multiple nodes of a cluster.
  • Dataset: A collection of partitioned data.

RDD is a collection of data having the below properties.

  • Immutable – object whose state cannot be modified after it is created. Big data by default immutable in nature, which means people in general keep appending rather than updating the data. If we have immutability, then we can achieve 2 things: Parallel processing and Caching. But if every changes to an immutable data will lead to create a new object, we may have major issue on Memory. Sometimes we may need to apply multiple transformations to an RDD, as per immutability concept for each transformation, it will create a separate copy of the same data, which is very bad in terms of Big Data, also, may lead to a poor performance as well. These challenges can be overcome by next property of RDD, i.e., Lazy Evaluated.
  • Lazy Evaluated – it means don’t compute the transformations till we use it. The issue with immutability discussed above is when we are applying multiple transformations on top of single data set, which will lead to multiple RDD creation; this can be suppressed just by being lazy, which means, Spark will not execute transformation when requested, instead it will execute the list of transformations when user request for results. This will avoid creating multiple copies of data and can solve the challenges of Immutability.
  • Type Inferred – Spark/Scala supports Type inferred feature. It can solve the challenges exists with Lazy Evaluation. Lazy data transformation allows you to re-create data in cache on failures using lineage, with which Spark can achieve fault-tolerance.
  • Cacheable – One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persists an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset or datasets derived from it. This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

Data Shuffling

Although we ship the code to worker server where the data processing happens, data movement cannot be completely eliminated. For example, if the processing requires data residing in different partitions to be grouped first, then we need to shuffle data among worker server. Spark carefully distinguish “transformation” operation in two types.

  • Narrow transformation refers to the processing where the processing logic depends only on data that is already residing in the partition and data shuffling is unnecessary. Examples of narrow transformation includes filter(), sample(), map(), flatMap() …. etc.
  • Wide transformation refers to the processing where the processing logic depends on data residing in multiple partitions and therefore data shuffling is needed to bring them together in one place. Example of wide transformation includes groupByKey(), reduceByKey() … etc.

The primary goal when choosing an arrangement of operators is to reduce the number of shuffles and the amount of data shuffled. This is because shuffles are fairly expensive operations; all shuffle data must be written to disk and then transferred over the network. repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles.

Other Optimization Techniques

When we really cannot avoid data shuffling, we can use some of the optimization technique to do the job.

  • To join 2 big RDDs, we can apply filter on each RDD first before join.
  • To join 1 big RDD and 1 small RDD, we can broadcast and copy over the small RDD to every partition and avoid data shuffling on big RDD. It is best fit for small static lookup table because all broadcast variables need to be able to fit in memory and immutable. If you need to change it, you can consider to use accumulator instead.
  • Avoid groupByKey when performing an associative reductive operation. For example, rdd.groupByKey().mapValues(_.sum) will produce the same results as rdd.reduceByKey(_ + _).
  • Avoid reduceByKey when the input and output value types are different.
  • Avoid the flatMap-join-groupBy pattern
  • More to come later…

Installation Spark

  • brew install apache-spark
  • copy log4j.properties.template to log4j.properties under /usr/local/Cellar/apache-spark/2.1.0/libexec/conf
  • edit the log4j.properties file and change the log level from INFO to ERROR on log4j.rootCategory.
  • make sure you are using python 2.7. If you are using anaconda with python 3.6 as your default, you can change the default and have it pointed to python 2.7 on .bash_profile like:
  • after that, you can run “pyspark” under /usr/local/Cellar/apache-spark/2.1.0 and test the commands below to verify if the setup is correct.

Download DataSet

Download movie rating dataset (you can use the 100k sample below):

Run spark app

  • Find the most popular movie (name, count) – ref: popular-movie-nicer.py

Run on Command prompt

Run on IntelliJ

Project SDK’s ClassPath

Environment Variables (Select Python > Edit Default)

After this, you can use IntelliJ for run and debug the python spark app:


Spark caches the data to be processed, allowing it to me 100 times faster than Hadoop. Spark uses Akka for Multithreading, managing executor state, scheduling tasks.

It uses Jetty to share files (Jars and other files), Http Broadcast, run Spark Web UI. Spark is highly configurable, and is capable of utilizing the existing components already existing in the Hadoop Eco-System. This has allowed Spark to grow exponentially, and in a little time many organisations are already using it in production.


Learning Spark

log in

Use demo/demo public access

reset password

Back to
log in
Choose A Format
Personality quiz
Trivia quiz