Monday, May 17, 2010

Another Note on Fragmented-Replicated-Join

Interesting work in the database community on the subject of FRJ...

A second join that we've had to face with is the computation of disjunctive equi-joins. This seems like an easy problem, though not so simple.

In pig, if I want to join table A and B that is expressed in the following SQL:

select * from A, B where (
          A.ind1=B.ind1 AND
          (A.ind2=B.ind2 OR A.ind3=B.ind3) AND 
          (A.ind4=B.ind4 OR A.ind5=B.ind5));

How would one do this? Pig only supports conjunctive equi-join of the following form:


select * from A, B where (
          A.ind1=B.ind1 AND
          A.ind2=B.ind2 AND
          A.ind3=B.ind3 AND 
          A.ind4=B.ind4 AND
          A.ind5=B.ind5);


C = JOIN A by (ind1, ind2, ind3, ind4, ind5),
         B by (ind1, ind2, ind3, ind4, ind5),

I'm not even looking for an efficient way to do this, just any way to do it in pig or in a map-reduce algorithm would be sufficient for now.

2 comments:

  1. You can simulate disjunctive equi-join with strict equi-joins. To achieve the following SQL:

    select * from A, B where (
    A.ind1=B.ind1 AND
    (A.ind2=B.ind2 OR A.ind3=B.ind3) AND
    (A.ind4=B.ind4 OR A.ind5=B.ind5));


    do instead

    AA = foreach A generate FLATTEN(
    MAKEBAG(
    MAKETUPLE('group1','ind1',ind1),
    MAKETUPLE('group1','ind2',ind2),
    MAKETUPLE('group2','ind3',ind3),
    MAKETUPLE('group2','ind4',ind4)
    )) as (group_type,col_type,col_value),
    )) as (col_type,col_value),
    additional_field
    );

    BB = foreach B generate FLATTEN(
    MAKEBAG(
    MAKETUPLE('group1','ind1',ind1),
    MAKETUPLE('group1','ind2',ind2),
    MAKETUPLE('group2','ind3',ind3),
    MAKETUPLE('group2','ind4',ind4)
    )) as (group_type,col_type,col_value),
    additional_field
    );

    CC = JOIN AA by (ind1,group_type,col_type, col_value),
    BB by (ind1,group_type,col_type, col_value);

    DD = group CC by (ind1, group_type);
    EE = foreach DD {
    AT = DISTINCT AA.col_type;
    BT = DISTINCT BB.col_type;
    ATC = COUNT(AT);
    BTC = COUNT(BT);
    AFIELDS = LIMIT AA.additional_field 1;
    BFIELDS = LIMIT BB.additional_field 1;
    generate ATC, BTC, AFIELDS, BFIELDS;
    }

    FF = filter EE by (ATC>0 AND BTC>0);

    RESULT = foreach FF generate FLATTEN(AFIELDS), FLATTEN(BFIELDS);



    This gives the ability to perform any expression that can be expressed in CNF of equality comparisons.


    I hope this wasn't home work for some undergraduate class you are taking...

    ReplyDelete
  2. For clarity, I've retyped the Pig Latin code to perform the arbitrary conjunctive normal form equi-join using some short hands for tuples and bags:


    AA = foreach A generate FLATTEN(
    {('group1','ind1',ind1),('group1','ind2',ind2),('group2','ind3',ind3),('group2','ind4',ind4)}) as (group_type,col_type,col_value),
    additional_field
    ;

    BB = foreach B generate FLATTEN(
    {('group1','ind1',ind1),('group1','ind2',ind2),('group2','ind3',ind3),('group2','ind4',ind4)}) as (group_type,col_type,col_value),additional_field
    ;

    CC = JOIN
    AA by (ind1,group_type,col_type, col_value),
    BB by (ind1,group_type,col_type, col_value);

    DD = group CC by (ind1, group_type);

    EE = filter DD by (NOT IsEmpty(AA.col_type) AND NOT IsEmpty(BB.col_type);

    RESULT = foreach EE generate FLATTEN(LIMIT AA.additional_fields 1), FLATTEN(LIMIT BB.additional_fields 1);



    YMMV, of course.

    ReplyDelete