Saturday, May 29, 2010

What does it mean to be superficial

Okay, this does not need to be said out loud for most people... But somehow, I had to type it out on my blog to understand it...

Well, to start, I should say, the phrase "you're so superficial" comes to my mind often. By that phrase, I mean it in a most derogatory way. (It's kind of Chinese I suppose) But sometimes, after some decades of this kind of thinking (and failing), I finally decided to write out what it means to be superficial... When I curse in my head and think of some one as such superficial pieces of shit, what is it exactly that I mean? And am I missing something?


Well, it should be easy. Superficiality often has physical meaning: inside, but near the outside; where inside, outside, and distance are defined. On a person, being superficial often means to make gross and short term observations and to decide or infer based on them. Though, as a description of a person, it is more often used as a judgement of ethics and moral: being superficial is to be superficial in a wrong and immoral way.

I went to wikipedia and tried to read up on morality and ethics and found that if they were formatted into wikibook, the pdf would be some where in the 1.1 kilopages.

Okay, let's do something simpler instead. Superficiality is a pattern of action and thoughts as defined by superficiality of physical or inferential way. If such a pattern of action and thoughts have negative utility, then these actions are considered Superficial and the actor a Superficial Actor.


But this definition actually immediately clears up our problem: being Superficial (in above defined capitalized Superficial sense) that generates positive utility is the right thing to do. If we do not do the superficial things that has positive utility, then we are actually being Superficial in the very negative sense--i.e. if we don't generate utility by acting on simple and obvious (aka superficial), then we are Superficial and generate negative utility, which is just truly Superficial.
Also, to not take advantage of the superficial, we become stupid.
Example: Person in front of you is holding a knife. Superficially, I observe a knife and make inference that she is dangerous. If I do not follow this superficial observation (seeing an object resembling a knife, for a brief, albeit recent, moment) and run away, then I am stupid.

But a thought does occur to me just now. It's a good thing I didn't read into Ethics, because there appears to be some serious conflict here. Let's say racial profiling at the airport.

Certainly skin color and racial appearance is a superficial aspect of each traveller that superficial TSA agents might use. But this surely does not make then Superficial.

Our law of the land require that government and public corporations operate in ways that reduce superficiality. I believe this reduces Superficiality as well. But in the case of TSA, as much as I hate being inspected returning from foreign land because my people some times will smuggle seeds (no, not weed seed, unless you consider bamboo a weed, but the two weeds are illegal for different reasons.) and fishes..., and seriously, sometimes I will sneak seeds in with me for my parents, friends or relatives. So, I, as a traveler, personally dislike the superficiality of racial profiling, but also appreciate it.

To be honest. If it was the case that 
\forall{x\in X}{{x\not\in{plane}} \bigwedge {I\neq x}} \rightarrow safe

I will accept that X is superficial.

I support the constitution and believe that that is Superficial, but I am sadly Superficial that way, and I believe most voting Americans are too. Am I right?

Basically, my original thing to write down is that I need to learn to be superficial without being Superficial because I am stupid and Superficial by not taken advantage of many superficialities... My writing it out helps me to accept this and to change what I can about it in myself.

Monday, May 17, 2010

Another Note on Fragmented-Replicated-Join

Interesting work in the database community on the subject of FRJ...

A second join that we've had to face with is the computation of disjunctive equi-joins. This seems like an easy problem, though not so simple.

In pig, if I want to join table A and B that is expressed in the following SQL:

select * from A, B where (
          A.ind1=B.ind1 AND
          (A.ind2=B.ind2 OR A.ind3=B.ind3) AND 
          (A.ind4=B.ind4 OR A.ind5=B.ind5));

How would one do this? Pig only supports conjunctive equi-join of the following form:


select * from A, B where (
          A.ind1=B.ind1 AND
          A.ind2=B.ind2 AND
          A.ind3=B.ind3 AND 
          A.ind4=B.ind4 AND
          A.ind5=B.ind5);


C = JOIN A by (ind1, ind2, ind3, ind4, ind5),
         B by (ind1, ind2, ind3, ind4, ind5),

I'm not even looking for an efficient way to do this, just any way to do it in pig or in a map-reduce algorithm would be sufficient for now.

The Psychology of Farting

As anybody who works in an office knows, other people fart. Often silently but pungently.

A curious question comes to mind. When I smell another person's relief, what is the right thing for me to do?

My initial reaction, as I did recently, was to walk away swiftly, interrupting my own conversation with the other person, who is possibly the perpetrator.

But this makes me an obvious suspect as the passer of gas. Because, usually, farting precedes or is preceded by or certainly accompanied by the need to have a bowel movement. If I move to leave the room, obvious assumption is that I just farted and am headed away to prevent additional intrusion into others' noses and to relieve my bowel of it's contents.


But if I do not leave, then I must suffer the smell until it dicipates naturally, for if I leave anytime before then, or at least, before another person gives up and leaves the room, it would appear that I am in more urgent need of relief than the other person, and therefore the original horn blower.


But if I sit there, silently, disguising my disgust, my obviously artificial facial expression will again make me the trumpeter in my pants.

So, I act naturally. As naturally as one can in such a situation.

If I point out who did it, it would seem that I am attempting to divert attention away from myself. I would only do that if I was actually broadcasting the expiration date.

If someone else point out that I farted, my attempt to deny it would clearly be just another adjustment for inflation.

But if I falsely accept the accusation that I was the one who developed that WMD. Nobody will ever suspect that I was lying!

Sigh, such sad state are my mental affairs.

An discussion of the skewed join

I never learned database stuff in cs classes... But recently had to use Pig's Skewed Join and thought it was interestiong:

The skewed join samples the data (by running a full MR through the data) and splits data belonging to the larger keys to several reducers and replicating the smaller table to all those reducers.

My initial thought when exposed to this is that wow, that's so cool, but so dumb. Why do a full MR to sample the data? Why not utilize hdfs to actually sample records from the input to approximate the split?

Secondly, the optimization to split only the large key into several partitions is unecessary... Consider this instead: Approximately compute the mode hash partition. (i.e. hash key into hash space, compute the hash space that would have received the most keys, possibly exceeding that reducer's capabilities) And in reality, often this is will not just be identical hash partitions, but identical keys.

Compute conservatively the number of reducers will be needed to compute that hash partition, say this number is k. Then in the mapper stage add to the key a randomly to each record an integer between 0 and k-1. (Add key as in add a separate key, not arithmetic addition). Cross the smaller table with the numbers [0,k-1] and use the integer as the key as well. Join on the reducer side.


I argue that this doesn't increase, excessively, runtime/resource consumption as compared to the implementation that splits only the largest keys into several reducers. By obviousness. The small keys will be sent to several reducers, but the smaller table will be waiting for it there, so no biggie.

It reduces cost by one MR, because we can sample, or if the data is known, we can just allow the language to specify the number of splits to happen.

r= join x by a, y by b using "skewed 100";


Finally, to push this to an extreme, Why bother even generating the second key at all? Similar to the Pig's map-side-join called "replicated", the replicated join can actually happen on the reducer side. Just send small table to each reducer, and randomly throw records at the reducers (w/o even putting into a bucket), and perform the join there.

The problems only occur when "Small table" is large. In which case, too many different hash keys on one reducer may overwhelm the system. But the proposed system here is no worse than the described system that is implemented righ tnow.

I guess the complication comes in when there are other operations and joins involved. If for instance, an operation on the join reduces the size of the data significantly (i.e. filter) then putting it on the mapper side is worth the while. Because, obviously.

But if that is not the case it seems always more worth while to replicate to reducer than to mapper. Because there is a chance that the small table won't have to be replicated as much (big hash partitions will only receive one row in the small table). AND, the cost of communicating that extra bit of random integer is almost surely smaller than copying smaller table to mappers, joining, and then taking big table row + small table row and sending to reducer. Right?? Send smaller table once to reducer and be done with it. Send a small integer along with big row, but that's probly compressed away because it'll be the same integer in most rows. (Recall that data is skewed, so most data is in the "mode" key and that key all goto a few reducers, so "mode" key and it's random number all get compressed to nil)

So, I guess the solution is not to only to allow us to specify the skewness but also to specify whether

Using "replicated"

is map side or reducer side.


Using "replicated to mapper"

vs

Using "replicated to reducer"



I want to explore the exact situations under which the above proposed features are superior. As well I'd like to extend the analysis to multi-table situations.

wlog:
Join A by key, B by key, C by key...

with sizes A>B>C....

what kind of key distribution will.....