Map side Join Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side. Both joining techniques comes with its own kind of pros and cons. Map side join could be more efficient to reduce side but strict format requirement is very tough to meet natively. however if we would prepare this kind of data through some other MR jobs, will loose the expected performance over reduce side join.
Data should be partitioned and sorted in particular way. Each input data should be divided in same number of partition. Must be sorted with same key. All the records for a particular key must reside in the same partition. Reduce Side Join Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type. This type of join would be performed at reduce side. i.e it will have to go through sort and shuffle phase which would incur network overhead. to make it simple we are going to add the steps needs to be performed for reduce side join. Reduce side join uses few terms like data source, tag and group key lets be familiar with it.
Data Source is referring to data source files, probably taken from RDBMS Tag would be used to tag every record with its source name, so that its source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later. Group key is referring column to be used as join key between two data sources.
Ready to start your tutorial with us? That's great! Send us an email and we will get back to you as soon as possible!