Resilient Distributed Datasets (RDD)
Concepts about Resilient Distributed Datasets (RDD) are:
- The main approach of Spark programming is RDD.
- Spark is extremely fault-tolerant. It has collections of objects spread across a cluster that can be operating on in parallel.
- By using Spark it can automatically be recovered from machine failure.
- We can create an RDD either by copying the elements from an existing collection or by referencing a dataset stored externally.
- There are two types of operations performed by RDDs: transformations and actions.
- The Transformation operation uses an existing dataset to create a new one. Example: Map, filter, join.
- Actions performed on the dataset and return the value to the driver program. Example: Reduce, count, collect, save.
If the availability of memory seems insufficient, then the data is written to disk like MapReduce.
Image source: by me
Installation of spark in Google colab:
Spark is an efficient data processing framework. we can easily install it in the Google colab.
Install java !apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Install spark (change the version number if needed) !wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
#Unzip the spark file to the current folder !tar xf spark-3.0.0-bin-hadoop3.2.tgz
#Set your spark folder to your system path environment. import os os.environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64” os.environ[“SPARK_HOME”] = “/content/spark-3.0.0-bin-hadoop3.2”
#Install findspark using pip !pip install -q findspark
#Spark for Python (pyspark) !pip install pyspark
#importing pyspark import pyspark
#importing sparksessio from pyspark.sql import SparkSession
#creating a sparksession object and providing appName spark=SparkSession.builder.appName(“local[*]”).getOrCreate()
#printing the version of spark print("Apache Spark version: ", spark.version)
Now, Google Colab is ready to implement Spark in python.