Resilient Distributed Datasets (RDD)

swapneel-panda-419bc751 · 2 June 2021 13:13

Resilient Distributed Datasets (RDD)

Concepts about Resilient Distributed Datasets (RDD) are:

The main approach of Spark programming is RDD.
Spark is extremely fault-tolerant. It has collections of objects spread across a cluster that can be operating on in parallel.
By using Spark it can automatically be recovered from machine failure.
We can create an RDD either by copying the elements from an existing collection or by referencing a dataset stored externally.
There are two types of operations performed by RDDs: transformations and actions.
The Transformation operation uses an existing dataset to create a new one. Example: Map, filter, join.
Actions performed on the dataset and return the value to the driver program. Example: Reduce, count, collect, save.

If the availability of memory seems insufficient, then the data is written to disk like MapReduce.

Image source: by me

Installation of spark in Google colab:

Spark is an efficient data processing framework. we can easily install it in the Google colab.

Install java !apt-get install openjdk-8-jdk-headless -qq > /dev/null

#Install spark (change the version number if needed) !wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

#Unzip the spark file to the current folder !tar xf spark-3.0.0-bin-hadoop3.2.tgz

#Set your spark folder to your system path environment. import os os.environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64” os.environ[“SPARK_HOME”] = “/content/spark-3.0.0-bin-hadoop3.2”

#Install findspark using pip !pip install -q findspark

#Spark for Python (pyspark) !pip install pyspark

#importing pyspark import pyspark

#importing sparksessio from pyspark.sql import SparkSession

#creating a sparksession object and providing appName spark=SparkSession.builder.appName(“local[*]”).getOrCreate()

#printing the version of spark print("Apache Spark version: ", spark.version)

Now, Google Colab is ready to implement Spark in python.