Introduction to Spark

In this tutorial, you will learn


1.  How to initialize Spark?

2.  How to input a textfile in Spark?

3.  How to apply an operation on every element in Spark RDD?

4.  How to apply a filter operation on each element in Spark RDD?

5.  How to group the elements by key in Spark RDD?

6.  How to reduce the elements by key in Spark RDD?

7.  How to sort the elements by key in Spark RDD?

8.   How to reduce the number of partitions in Spark RDD?

9.  How to repartition the number of elements in Spark RDD?

10.  How to save an RDD in Spark?

11.  How to return the number of elements in Spark RDD?

12.  How to return the first element in Spark RDD?

13.  How to return the first n array of elements in Spark RDD?

1. Initialize Spark?

from pyspark import SparkContext, SparkConf
conf = pyspark.SparkConf().setAppName("Spark Tutorials")
sc = SparkContext.getOrCreate(conf=conf)

By creating a  SparkContext object, we are telling Spark to access the cluster.

2. Introduction to RDD’s                                                                                           

Spark works on a resilient distributed dataset (RDD). 

RDD’s can be partitioned across multiple nodes and operations can be done in parallel.

RDD’s are immutable.

3. Operations on RDD’s                                                                                    

textfile( ) : Returns the file that in form of a RDD.

dataset = sc.textFile("inputs.txt")
dataset.take(10)

map(operation) : Returns a new Rdd after applying the operation to each element

map_example=dataset.map(lambda x: (x,1))
map_example.take(20)

filter(operation) : Returns a new RDD after selecting the elements on which operation returns True.

filter_example=dataset.filter(lambda x: len(x)>5)
filter_example.take(20)

groupByKey(operation) : Returns a new RDD where the values for each key are grouped by the key.

groupByKey_example=map_example.groupByKey()
groupByKey_example.take(10)

reduceByKey(operation) : Returns a new RDD where the values for each key are reduced using the given operation function.

reduceByKey_example = map_example.reduceByKey(lambda a, b: a + b)
reduceByKey_example.take(20)

sortByKey(operation) : Returns a new RDD where the values for each key are sorted using the given sort function. 

sortByKey_example = map_example.sortByKey()
sortByKey_example.take(20)

coalesce(number): Returns an RDD with a decreased number of partitions as given.

coalesce_example=map_example.coalesce(5)
coalesce_example

repartition(number) : Returns a new RDD with more or less number of partitions in a randomly shuffled manner.

repartition_example=map_example.repartition(1)
repartition_example
Introduction to spark. How to use repartition opeartion on RDD?

3. Common Functions in Spark

 

saveAsTextFile : For saving a RDD as a textfile.

filter_example.saveAsTextFile("savefile.csv")

count : Returns the number of elements in an RDD.

count_example=map_example.count()
count_example
How to return the number of elements in a RDD in spark?

first : Returns the first  element in an RDD.

filter_example.first()

take(n) : Returns an array of first n number of elements.

filter_example.take(5)
How to return the first n number of elements in a RDD?

Summary


1.  SparkContext : To initialize Spark 

2.  textfile( ) : To open a textfile and return an RDD.

3.  map( ) : To apply  a specific operation on every element in Spark RDD.

4.  filter( ) : To apply filter operation on each element in Spark RDD.

5.  groupByKey( ) : To group the elements by key in Spark RDD.

6.  reduceByKey( ) : To reduce the elements by key in Spark RDD.

7.  sortByKey( ) : To sort the elements by key in Spark RDD?

8.  coalesce( ) : To reduce the number of partitions in Spark RDD.

9.  repartition( ) : To repartition the number of elements in Spark RDD.

10.  saveAsTextFile( ) : To save an RDD in Spark.

11.  count( ) : To return the number of elements in Spark RDD.

12.  first( ) : To return the first element in Spark RDD.

13.  take( ) :  To return the first n array of elements in Spark RDD.

 You can find the Github link here.