Terasort consist of three steps, sampling, tagging(map) and sorting(reduce). At a very high level, sampling is performed in JobClient. We sort a subset of input data(e.g. 100,000 keys) and divide them into R partitions. Then, we find the upper bound and the lower bound of every partition(called the reduce boundaries) and store them. For example, if the sampling data is [b, abc, abd, bcd, abcd, efg, hii, afd, rrr, mnk]
. Then, after you sort it, the resulting data will be [abc, abcd, abd, afd, b, bcd, efg, hii, mnk, rrr]
If you have 4 reducers, the reduce boundaries will be abd, bcd ,and mnk