HDFS (Hadoop Distributed File System)

Hadoop Distributed file system

Hadoop file system was developed based on the distributed file system model. It runs on commodity hardware. In contrast to different distributed systems, HDFS is extremely fault-tolerant and designed using inexpensive hardware.

HDFS is able to hold a very huge amount of data and also provides easier access to those data. To store such a large amount of data, the files are stored over multiple systems. These files are stored in a redundant fashion to rescue the system from potential data losses just in case of failure. HDFS additionally makes applications offered to multiprocessing.

  • It is liable for storing data on a cluster as distributed storage and processing.
  • The data servers of the name node and knowledge node facilitate users to simply check the status of the cluster.
  • Each block is replicated multiple times by default 3 times. Replicas are stored on completely different nodes.
  • Hadoop Streaming acts like a bridge between your Python code and therefore the Java-based HDFS, and enables you to seamlessly access Hadoop clusters and execute MapReduce tasks.
  • HDFS provides file permissions and authentication.

Image source: by me

Hadoop Installation in Google Colab

Hadoop is a java programming-based data processing framework. Let’s install Hadoop setup step by step in Google Colab. There are two ways first is we have to install java on our machines and the second way is we will install java in google colab, so there is no need to install java on our machines. As we are using Google colab we choose the second way to install Hadoop:

Install java !apt-get install openjdk-8-jdk-headless -qq > /dev/null

#create java home variable import os os.environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64” os.environ[“SPARK_HOME”] = “/content/spark-3.0.0-bin-hadoop3.2”

Step 1: Install Hadoop

#download hadoop !wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

#we’ll use the tar command with the -x flag to extract, -z to uncompress, #-v for verbose output, and -f to specify that we’re extracting from a file !tar -xzvf hadoop-3.3.0.tar.gz

#copying the hadoop file to user/local !cp -r hadoop-3.3.0/ /usr/local/

Step 2: Configure java Home variable

#finding the default Java path !readlink -f /usr/bin/java | sed “s:bin/java::”

Step 3: Run Hadoop

#Running Hadoop !/usr/local/hadoop-3.3.0/bin/hadoop

!mkdir ~/input

!cp /usr/local/hadoop-3.3.0/etc/hadoop/*.xml ~/input

!ls ~/input

!/usr/local/hadoop-3.3.0/bin/hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep ~/input ~/grep_example ‘allowed[.]*’

Now, Google Colab is ready to implement HDFS.