Interesting work in the database community on the subject of FRJ...
A second join that we've had to face with is the computation of disjunctive equi-joins. This seems like an easy problem, though not so simple.
In pig, if I want to join table A and B that is expressed in the following SQL:
select * from A, B where (
A.ind1=B.ind1 AND
(A.ind2=B.ind2 OR A.ind3=B.ind3) AND
(A.ind4=B.ind4 OR A.ind5=B.ind5));
How would one do this? Pig only supports conjunctive equi-join of the following form:
select * from A, B where (
A.ind1=B.ind1 AND
A.ind2=B.ind2 AND
A.ind3=B.ind3 AND
A.ind4=B.ind4 AND
A.ind5=B.ind5);
C = JOIN A by (ind1, ind2, ind3, ind4, ind5),
B by (ind1, ind2, ind3, ind4, ind5),
I'm not even looking for an efficient way to do this, just any way to do it in pig or in a map-reduce algorithm would be sufficient for now.
You can simulate disjunctive equi-join with strict equi-joins. To achieve the following SQL:
ReplyDeleteselect * from A, B where (
A.ind1=B.ind1 AND
(A.ind2=B.ind2 OR A.ind3=B.ind3) AND
(A.ind4=B.ind4 OR A.ind5=B.ind5));
do instead
AA = foreach A generate FLATTEN(
MAKEBAG(
MAKETUPLE('group1','ind1',ind1),
MAKETUPLE('group1','ind2',ind2),
MAKETUPLE('group2','ind3',ind3),
MAKETUPLE('group2','ind4',ind4)
)) as (group_type,col_type,col_value),
)) as (col_type,col_value),
additional_field
);
BB = foreach B generate FLATTEN(
MAKEBAG(
MAKETUPLE('group1','ind1',ind1),
MAKETUPLE('group1','ind2',ind2),
MAKETUPLE('group2','ind3',ind3),
MAKETUPLE('group2','ind4',ind4)
)) as (group_type,col_type,col_value),
additional_field
);
CC = JOIN AA by (ind1,group_type,col_type, col_value),
BB by (ind1,group_type,col_type, col_value);
DD = group CC by (ind1, group_type);
EE = foreach DD {
AT = DISTINCT AA.col_type;
BT = DISTINCT BB.col_type;
ATC = COUNT(AT);
BTC = COUNT(BT);
AFIELDS = LIMIT AA.additional_field 1;
BFIELDS = LIMIT BB.additional_field 1;
generate ATC, BTC, AFIELDS, BFIELDS;
}
FF = filter EE by (ATC>0 AND BTC>0);
RESULT = foreach FF generate FLATTEN(AFIELDS), FLATTEN(BFIELDS);
This gives the ability to perform any expression that can be expressed in CNF of equality comparisons.
I hope this wasn't home work for some undergraduate class you are taking...
For clarity, I've retyped the Pig Latin code to perform the arbitrary conjunctive normal form equi-join using some short hands for tuples and bags:
ReplyDeleteAA = foreach A generate FLATTEN(
{('group1','ind1',ind1),('group1','ind2',ind2),('group2','ind3',ind3),('group2','ind4',ind4)}) as (group_type,col_type,col_value),
additional_field
;
BB = foreach B generate FLATTEN(
{('group1','ind1',ind1),('group1','ind2',ind2),('group2','ind3',ind3),('group2','ind4',ind4)}) as (group_type,col_type,col_value),additional_field
;
CC = JOIN
AA by (ind1,group_type,col_type, col_value),
BB by (ind1,group_type,col_type, col_value);
DD = group CC by (ind1, group_type);
EE = filter DD by (NOT IsEmpty(AA.col_type) AND NOT IsEmpty(BB.col_type);
RESULT = foreach EE generate FLATTEN(LIMIT AA.additional_fields 1), FLATTEN(LIMIT BB.additional_fields 1);
YMMV, of course.