|
EECS-4411M:
Database Management Systems
York University
Winter 2017
|
Project #3:
Merge Join
(from Java Minibase*)
|
|
|
|
For
this project,
you are to implement the merge-join algorithm within Java Minibase.
For this,
you are implementing a high level operator,
effectively the merge part of the sort-merge join,
on top of existing operators.
Much of the infrastructure needed is in place.
For instance,
you have an external sort routine available to call.
Note that you are implementing
the general sort-merge join as discussed in the text and in class,
and not the 2-pass sort-merge join.
Both files are presented as sorted inputs.
Your routine will be merge the two streams,
one for the inner "table" and one for the "outer",
to accomplish the join.
The tests evaluate real SQL queries evaluated by the Minibase code,
using your SortMerge routine!
These queries also exercise other operators,
but ones implemented already.
And the tuples / records are real too.
Cool, eh?
The real difficulty of the project is not amount of code.
In the end,
there is little code to write.
Again (like with the buffer-pool project),
it is interfacing with Minibase,
understanding how to use the different components,
and understanding how to work with Minibase's Tuple
structure.
Of course, this time around,
you have some familiarity with Minibase already,
so things should be somewhat easier.
|
|
Teams
You
may work in teams of size two.
Email me with who composes your team.
(You may work solo,
but you likely would benefit in a team.)
|
|
Documentation
Review
the textbook material and class notes on merge join.
The package that contains SortMerge is iterator.
You should also read the
Javadoc Minibase documentation
carefully for the package.
There are many, many classes under the iterator package.
You are only to implement
- SortMerge.java
- JoinNewFailed.java
- JoinLowMemory.java
The first is for the SortMerge class.
The next two are the only two extra exception classes you should define.
You'll be throwing exceptions left and right,
but all the rest you shall inherit
(and be forced to pass along).
Everything else you need from the iterator package
is in place in the JAR file,
and you can read about the APIs in the JavaDoc.
Things you will be using from the iterator package
will include, for example, the following.
- class Sort:
An external sort routine.
Use this to sort inner and/or outer, if needed,
before doing the merge.
- class TupleUtils:
This contains useful methods
for handling and comparing Tuple's.
You will need to look at other classes too,
such as
- class Tuple in heap:
This is the structure for a record / Tuple in Java Mininbase.
- AttrType in global:
These are the supported attribute types.
|
|
|
|
What You are to Implement
|
|
You
are to implement the SortMerge class.
You need to implement the constructor,
and two public methods, get_next() and close().
See the JavaDoc for the API.
These are part of Java Minibase's iterator package,
and should be declared as such.
Warning:
Yes, Java Minibase has an iterator package
and implements an Iterator class.
So does java.util!
Clearly, you cannot import both at the same time.
Javac will rightfully complain, if you try.
There is no occasion where you will need java.util.Iterator
for the project.
This is just so that you are aware of the issue.
The SortMerge constructor joins two tables,
call them R and S,
represented by the Iterator object am1 and am2,
respectively, using the merge join algorithm.
Read carefully about the TupleUtils class.
You will need to call the setup_op_tuple method
for "formatting" and handling the joined records.
You will also need to call the compare functions to compare
the join columns of records from R and S.
You will call the Sort constructor
to sort a table (represented by Iterator class).
The main method of SortMerge is get_next().
It returns the next joined tuple.
Note that SortMerge acts like a Java iterator.
It returns type Tuple.
When there are no more tuples from the join output,
it returns null.
Complication:
There may be duplicate values in the input stream of tuples,
both in the inner and the outer streams.
Consider a join of Sailors and Reserves on sid.
Clearly one sailor may have many reservations.
How do you handle this?
You need to collect the group of all tuples
from the inner input stream who match
on the join column value (collection A)
while your algorithm is producing the join matches
with that A record.
Then, if the next A record has the same
join column(s) value(s),
you will use the B group that you just collected
to find the matches with this new A record.
Where do you temporarily store this group
of B records, all of the same join value?
In a heap file on disk?
No. That would be quite expensive I/O-wise if we always did this!
In the buffer pool?
No.
(Although that could be made into a reasonable approach.
However, we would have change much in Minibase to do this.)
Instead,
we allocate a temporary tuple-buffer for the B group.
(Note that "buffer" here has nothing to do with "buffer pool" necessarily!)
Use the functions provided by the existing IoBuf class for this.
IoBuf is well behaved.
It keeps the group in main memory while it has room.
If the group being stored gets too big,
it creates a tempory heap-file on disk to put them.
So IoBuf handles the sticky issue of where the B group
should be temporarily stored for you.
This issue happens quite a bit in database-system implementation.
The IoBuf class is a useful mechanism in Minibase
that provides a solution.
Finally,
you need to implement close(),
which cleans up.
The outside code that has been calling get_next()
of the SortMerge object
is responsible for calling close() once done.
(Look at the tester SM_JoinTest.java
to see the sequence of calls.)
|
|
|
|
Where to Find the Makefiles, Code, etc.
|
|
Please
copy all the files from /eecs/course/4411/mj/,
on red.cs.yorku.ca or any of the York PRISM machines,
into a local directory of your own, say mj/.
You can do this easily, for instance, by saying
- % cp -pr /eecs/course/4411/mj/src mj
where the second argument, mj, is the directory name
of the directory that you are creating.
In mj,
there are two sub-directories.
- iterator/
- The code you write for SortMerge and the Iterator
package goes here.
- tests/
- Makefile,
TestDriver.java, and
SM_JoinTest.java:
Sort-Merge-Join test driver program.
Note that you use this Makefile to run the test suite
on your bufmgr code.
Again, you can find the interfaces for all necessary interfaces
in the Javadoc Minibase documentation.
|
|
|
|
What to turn in (and when)
|
|
You
will use submit to submit your code,
together with copies of the
output produced by running the tests provided.
Due by 11:59pm Tuesday 4 April 2017.
Submit as
-
% submit 4411 mj mj
where the second keyword mj/
is the name of the directory in which you put your code
as specified.
(The first keyword “mj” in the submit command
above specifies the project name.)
For documentation,
you do not need to produce javadoc,
and you do not need to comment extensively, line-by-line.
You should comment your code though,
commenting the methods,
commenting the fields,
and placing relevant comments in the code
where things would not be clear otherwise.
Include a small text file named report.txt .
In it,
state the team members
—
just yourself, if you did not work with others
—
and what contributions each person made.
State any problems you encounterd with the project,
and briefly outline to approach of your solution.
I do not expect this to be long;
likely, half a page.
For coding style,
choose a reasonable indentation convention
and be consistent.
Name your variables with understandable names.
And follow general good coding practise.
I will not grade the style component with respect to specific
style guidelines,
but rather with respect to general good coding practise.
A portion of the project grade
will be based on coding style.
Namely,
your code should be readable.
|
|
|
|
See
the FAQ (Frequently Asked Questions)
for the merge-join project
for more information.
Read over the FAQ to start.
There is already quite a bit covered in it.
Not everything in the FAQ will make sense to you in the beginning.
Don't worry about that.
Each of those issues will become clear as your project proceeds.
Refer back to the FAQ as needed.
Of course,
send reasonable questions my way, as needed.
Pertinent questions sent my way regarding the project
will make their way into the FAQ.
So check the FAQ online periodically for updates.
You may discuss the project with others in the class.
The work you turn in has to be your own.
|
|
|
Parke Godfrey
|
|
*With much thanks and credit to the many folks
at the University of Wisconsin
who built Minibase and who designed these projects.
|
|