What is distributed cache? What are its benefits?

Hadoop’s distributed cache is a MapReduce framework service that caches files as they are needed.

Hadoop will make a file available on each DataNode, both in the system and in memory, where map and reduce jobs are running, after it has been cached for a given job. You may later access and read the cache file and use it to populate any collection (such as an array or hashmap) in your code.

The following are some of the advantages of employing distributed cache:

  • It distributes read-only text/data files as well as more sophisticated kinds like jars, archives, and so on. At the slave node, these archives are then unarchived.
  • The modification timestamps of cache files are tracked by distributed cache, indicating that the files should not be changed until a task is completed.