CSE-4411A
Database Management Systems

York University
Fall 2010
Project #2
Merge Join

from Java Minibase
(University of Wisconsin)
  Introduction

For this project, you are to implement the merge-join algorithm within Java Minibase.

For this, you are implementing a high level operator, effectively the merge part of the sort-merge join, on top of existing operators. Much of the infrastructure needed is in place. For instance, you have an external sort routine available to call.

Note that you are implementing the general sort-merge join as discussed in the text and in class, and not the 2-pass sort-merge join. Both files are presented as sorted inputs. Your routine will be merge the two streams, one for the inner "table" and one for the "outer", to accomplish the join.

The tests evaluate real SQL queries evaluated by the Minibase code, using your SortMerge routine! These queries also exercise other operators, but ones implemented already. And the tuples / records are real too. Cool, eh?

The real difficulty of the project is not amount of code. In the end, there is little code to write. Again (like with the buffer-pool project), it is interfacing with Minibase, understanding how to use the different components, and understanding how to work with Minibase's Tuple structure. Of course, this time around, you have some familiarity with Minibase already, so things should be somewhat easier.

 
  Available Documentation

Review the chapter on Implementation of Relational Operations (Chapter 14), in particular, the section on Sort-Merge Join.

The package that contains SortMerge is iterator. You should also read the Javadoc Minibase documentation carefully for the package.

There are many, many classes under the iterator package. You are only to implement

  • SortMerge.java
  • JoinNewFailed.java
  • JoinLowMemory.java

The first is for the SortMerge class. The next two are the only two extra exception classes you should define. You'll be throwing exceptions left and right, but all the rest you shall inherit (and be forced to pass along).

Everything else you need from the iterator package is in place in the JAR file, and you can read about the APIs in the JavaDoc.

Things you will be using from the iterator package will include, for example, the following.

  • class Sort: An external sort routine. Use this to sort inner and/or outer, if needed, before doing the merge.
  • class TupleUtils: This contains useful methods for handling and comparing Tuple's.

You will need to look at other classes too, such as

  • class Tuple in heap: This is the structure for a record / Tuple in Java Mininbase.
  • AttrType in global: These are the supported attribute types.

 
  What You are to Implement

You are to implement the SortMerge class. You need to implement the constructor, and two public methods, get_next() and close(). See the JavaDoc for the API. These are part of Java Minibase's iterator package, and should be declared as such.

Warning: Yes, Java Minibase has an iterator package and implements an Iterator class. So does java.util! Clearly, you cannot import both at the same time. Javac will rightfully complain, if you try. There is no occasion where you will need java.util.Iterator for the project. This is just so that you are aware of the issue.

The SortMerge constructor joins two tables, call them R and S, represented by the Iterator object am1 and am2, respectively, using the merge join algorithm.

Read carefully about the TupleUtils class. You will need to call the setup_op_tuple method for "formatting" and handling the joined records. You will also need to call the compare functions to compare the join columns of records from R and S. You will call the Sort constructor to sort a table (represented by Iterator class).

The main method of SortMerge is get_next(). It returns the next joined tuple. Note that SortMerge acts like a Java iterator. It returns type Tuple. When there are no more tuples from the join output, it returns null.

Complication: There may be duplicate values in the input stream of tuples, both in the inner and the outer streams. Consider a join of Sailors and Reserves on sid. Clearly one sailor may have many reservations.

How do you handle this?

You need to collect the group of all tuples from the inner input stream who match on the join column value (collection A) while your algorithm is producing the join matches with that A record. Then, if the next A record has the same join column(s) value(s), you will use the B group that you just collected to find the matches with this new A record.

Where do you temporarily store this group of B records, all of the same join value? In a heap file on disk?

No. That would be quite expensive I/O-wise if we always did this!

In the buffer pool?

No. (Although that could be made into a reasonable approach. However, we would have change much in Minibase to do this.) Instead, we allocate a temporary tuple-buffer for the B group. (Note that "buffer" here has nothing to do with "buffer pool" necessarily!) Use the functions provided by the existing IoBuf class for this. IoBuf is well behaved. It keeps the group in main memory while it has room. If the group being stored gets too big, it creates a tempory heap-file on disk to put them. So IoBuf handles the sticky issue of where the B group should be temporarily stored for you. This issue happens quite a bit in database-system implementation. The IoBuf class is a useful mechanism in Minibase that provides a solution.

Finally, you need to implement close(), which cleans up. The outside code that has been calling get_next() of the SortMerge object is responsible for calling close() once done. (Look at the tester SM_JoinTest.java to see the sequence of calls.)

 
  Where to Find the Makefiles, Code, etc.

Please copy all the files from /cs/course/4411/mj/, on red.cs.yorku.ca or any of the York PRISM machines, into a local directory of your own, say mj/. You can do this easily, for instance, by saying

% cp -pr /cs/course/4411/mj/src mj

where the second argument, mj, is the directory name of the directory that you are creating. In mj, there are two sub-directories.

  • iterator/
    • The code you write for SortMerge and the Iterator package goes here.
  • tests/
    • Makefile, TestDriver.java, and SM_JoinTest.java: Sort-Merge-Join test driver program. Note that you use this Makefile to run the test suite on your bufmgr code.

Again, you can find the interfaces for all necessary interfaces in the Javadoc Minibase documentation.

 
  What to Turn In, and When

You will use submit to submit your code, together with copies of the output produced by running the tests provided.

Project due date is Wednesday 8 December.

 
  More Information

See the FAQ (Frequently Asked Questions) for more information.

 
Parke Godfrey
With much thanks and credit to the many folks
at the University of Wisconsin
who built Minibase and designed these projects.