Introduction to PySpark | Distributed Computing with Apache Spark

Datasets are becoming huge. Infact, data is growing faster than processing speeds. Therefore, algorithms involving large data and high amount of computation are often run on a distributed computing system. A distributed computing system involves nodes (networked computers) that run processes in parallel and communicate (if, necessary).

MapReduce – The programming model that is used for Distributed computing is known as MapReduce. The MapReduce model involves two stages, Map and Reduce.

  1. Map – The mapper processes each line of the input data (it is in the form of a file), and produces key – value pairs.
    Input data → Mapper → list([key, value])
  2. Reduce – The reducer processes the list of key – value pairs (after the Mapper’s function). It outputs a new set of key – value pairs.
    list([key, value]) → Reducer → list([key, list(values)])

Spark – Spark (open source Big-Data processing engine by Apache) is a cluster computing system. It is faster as compared to other cluster computing systems (such as, Hadoop). It provides high level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark. We will cover PySpark (Python + Apache Spark), because this will make the learning curve flatter. To install Spark on a linux system, follow this. To run Spark in a multi – cluster system, follow this. We will see how to create RDDs (fundamental data structure of Spark).

RDDs (Resilient Distributed Datasets) – RDDs are immutable collection of objects. Since we are using PySpark, these objects can be of multiple types. These will become more clear further.

SparkContext – For creating a standalone application in Spark, we first define a SparkContext –



filter_none

edit
close

play_arrow

link
brightness_4
code

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)

chevron_right


RDD transformations – Now, a SparkContext object is created. Now, we will create RDDs and see some transformations on them.

filter_none

edit
close

play_arrow

link
brightness_4
code

# create an RDD called lines from ‘file_name.txt’
lines = sc.textFile("file_name.txt", 2)
  
# print lines.collect() prints the whole RDD
print lines.collect()

chevron_right


One major advantage of using Spark is that it does not load the dataset into memory, lines is a pointer to the ‘file_name.txt’ ?file.

A simple PySpark app to count the degree of each vertex for a given graph

filter_none

edit
close

play_arrow

link
brightness_4
code

from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)
def conv(line):
    line = line.split()
    return (int(line[0]), [int(line[1])])
def numNeighbours(x, y):
    return len(x) + len(y)
lines = sc.textFile('graph.txt')
edges = lines.map(lambda line: conv(line))
Adj_list = edges.reduceByKey(lambda x, y: numNeighbours(x, y))
print Adj_list.collect()

chevron_right


Understanding the above code

  1. Our text file is in the following format – (each line represents an edge of a directed graph)
    1    2
    1    3
    2    3
    3    4
    .    .
    .    .
    .    .PySpark
  2. Large Datasets may contain millions of nodes, and edges.
  3. First few lines set up the SparkContext. We create an RDD lines from it.
  4. Then, we transform the lines RDD to edges RDD.The function conv a?cts on each line and key value pairs of the form (1, 2), (1, 3), (2, 3), (3, 4), … are stored in the edges RDD.
  5. After this the reduceByKey aggregates all the key – pairs corresponding to a particular key and numNeighbours function is used for generating each vertex’s degree in a separate RDD Adj_list, which has the form (1, 2), (2, 1), (3, 1), …

Running the code

  1. The above code can be run by the following commands –

    $ cd /home/arik/Downloads/spark-1.6.0/
    $ ./bin/spark-submit degree.py
    
  2. You can use your Spark installation path in the first line.

We will see more on, how to run MapReduce tasks in a cluster of machines using Spark, and also go through other MapReduce tasks.

References

  1. http://lintool.github.io/SparkTutorial/
  2. https://spark.apache.org/

This article is contributed by Arik Pamnani. If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.




My Personal Notes arrow_drop_up
Article Tags :

3


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.