Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
Rather than serializing side data in the job configuration, it is preferable to distribute datasets using Hadoops distributed cache mechanism. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any particular node once per job.
You can use the distributed cache for copying files that do not fit in memory. MapFiles are very useful in this regard, since they serve as an on-disk lookup format (see MapFile on page 137). Because MapFiles are a collection of files with a defined directory structure, you should put them into an archive format (JAR, ZIP, TAR, or gzipped TAR) and add them to the cache using the -archives option.
Can you please provide an example how do we use distributed cache .
Ready to start your tutorial with us? That's great! Send us an email and we will get back to you as soon as possible!