Monday, May 17, 2010

An discussion of the skewed join

I never learned database stuff in cs classes... But recently had to use Pig's Skewed Join and thought it was interestiong:

The skewed join samples the data (by running a full MR through the data) and splits data belonging to the larger keys to several reducers and replicating the smaller table to all those reducers.

My initial thought when exposed to this is that wow, that's so cool, but so dumb. Why do a full MR to sample the data? Why not utilize hdfs to actually sample records from the input to approximate the split?

Secondly, the optimization to split only the large key into several partitions is unecessary... Consider this instead: Approximately compute the mode hash partition. (i.e. hash key into hash space, compute the hash space that would have received the most keys, possibly exceeding that reducer's capabilities) And in reality, often this is will not just be identical hash partitions, but identical keys.

Compute conservatively the number of reducers will be needed to compute that hash partition, say this number is k. Then in the mapper stage add to the key a randomly to each record an integer between 0 and k-1. (Add key as in add a separate key, not arithmetic addition). Cross the smaller table with the numbers [0,k-1] and use the integer as the key as well. Join on the reducer side.


I argue that this doesn't increase, excessively, runtime/resource consumption as compared to the implementation that splits only the largest keys into several reducers. By obviousness. The small keys will be sent to several reducers, but the smaller table will be waiting for it there, so no biggie.

It reduces cost by one MR, because we can sample, or if the data is known, we can just allow the language to specify the number of splits to happen.

r= join x by a, y by b using "skewed 100";


Finally, to push this to an extreme, Why bother even generating the second key at all? Similar to the Pig's map-side-join called "replicated", the replicated join can actually happen on the reducer side. Just send small table to each reducer, and randomly throw records at the reducers (w/o even putting into a bucket), and perform the join there.

The problems only occur when "Small table" is large. In which case, too many different hash keys on one reducer may overwhelm the system. But the proposed system here is no worse than the described system that is implemented righ tnow.

I guess the complication comes in when there are other operations and joins involved. If for instance, an operation on the join reduces the size of the data significantly (i.e. filter) then putting it on the mapper side is worth the while. Because, obviously.

But if that is not the case it seems always more worth while to replicate to reducer than to mapper. Because there is a chance that the small table won't have to be replicated as much (big hash partitions will only receive one row in the small table). AND, the cost of communicating that extra bit of random integer is almost surely smaller than copying smaller table to mappers, joining, and then taking big table row + small table row and sending to reducer. Right?? Send smaller table once to reducer and be done with it. Send a small integer along with big row, but that's probly compressed away because it'll be the same integer in most rows. (Recall that data is skewed, so most data is in the "mode" key and that key all goto a few reducers, so "mode" key and it's random number all get compressed to nil)

So, I guess the solution is not to only to allow us to specify the skewness but also to specify whether

Using "replicated"

is map side or reducer side.


Using "replicated to mapper"

vs

Using "replicated to reducer"



I want to explore the exact situations under which the above proposed features are superior. As well I'd like to extend the analysis to multi-table situations.

wlog:
Join A by key, B by key, C by key...

with sizes A>B>C....

what kind of key distribution will.....

No comments:

Post a Comment