Resilient Distributed Datasets (RDD)

Resilient Distributed Datasets (RDD)

Concepts about Resilient Distributed Datasets (RDD) are:

  • The main approach of Spark programming is RDD.
  • Spark is extremely fault-tolerant. It has collections of objects spread across a cluster that can be operating on in parallel.
  • By using Spark it can automatically be recovered from machine failure.
  • We can create an RDD either by copying the elements from an existing collection or by referencing a dataset stored externally.
  • There are two types of operations performed by RDDs: transformations and actions.
  • The Transformation operation uses an existing dataset to create a new one. Example: Map, filter, join.
  • Actions performed on the dataset and return the value to the driver program. Example: Reduce, count, collect, save.

If the availability of memory seems insufficient, then the data is written to disk like MapReduce.

Image source: by me

Installation of spark in Google colab:

Spark is an efficient data processing framework. we can easily install it in the Google colab.

Install java !apt-get install openjdk-8-jdk-headless -qq > /dev/null

#Install spark (change the version number if needed) !wget -q

#Unzip the spark file to the current folder !tar xf spark-3.0.0-bin-hadoop3.2.tgz

#Set your spark folder to your system path environment. import os os.environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64” os.environ[“SPARK_HOME”] = “/content/spark-3.0.0-bin-hadoop3.2”

#Install findspark using pip !pip install -q findspark

#Spark for Python (pyspark) !pip install pyspark

#importing pyspark import pyspark

#importing sparksessio from pyspark.sql import SparkSession

#creating a sparksession object and providing appName spark=SparkSession.builder.appName(“local[*]”).getOrCreate()

#printing the version of spark print("Apache Spark version: ", spark.version)

Now, Google Colab is ready to implement Spark in python.