EECS-4411M: Database Management Systems
York University
Winter 2017
Project #3: Merge Join
(from Java Minibase*)
  Introduction

For this project, you are to implement the merge-join algorithm within Java Minibase.

For this, you are implementing a high level operator, effectively the merge part of the sort-merge join, on top of existing operators. Much of the infrastructure needed is in place. For instance, you have an external sort routine available to call.

Note that you are implementing the general sort-merge join as discussed in the text and in class, and not the 2-pass sort-merge join. Both files are presented as sorted inputs. Your routine will be merge the two streams, one for the inner "table" and one for the "outer", to accomplish the join.

The tests evaluate real SQL queries evaluated by the Minibase code, using your SortMerge routine! These queries also exercise other operators, but ones implemented already. And the tuples / records are real too. Cool, eh?

The real difficulty of the project is not amount of code. In the end, there is little code to write. Again (like with the buffer-pool project), it is interfacing with Minibase, understanding how to use the different components, and understanding how to work with Minibase's Tuple structure. Of course, this time around, you have some familiarity with Minibase already, so things should be somewhat easier.

Teams

You may work in teams of size two. Email me with who composes your team. (You may work solo, but you likely would benefit in a team.)

Documentation

Review the textbook material and class notes on merge join.

The package that contains SortMerge is iterator. You should also read the Javadoc Minibase documentation carefully for the package.

There are many, many classes under the iterator package. You are only to implement

  • SortMerge.java
  • JoinNewFailed.java
  • JoinLowMemory.java

The first is for the SortMerge class. The next two are the only two extra exception classes you should define. You'll be throwing exceptions left and right, but all the rest you shall inherit (and be forced to pass along).

Everything else you need from the iterator package is in place in the JAR file, and you can read about the APIs in the JavaDoc.

Things you will be using from the iterator package will include, for example, the following.

  • class Sort: An external sort routine. Use this to sort inner and/or outer, if needed, before doing the merge.
  • class TupleUtils: This contains useful methods for handling and comparing Tuple's.

You will need to look at other classes too, such as

  • class Tuple in heap: This is the structure for a record / Tuple in Java Mininbase.
  • AttrType in global: These are the supported attribute types.

 
  What You are to Implement

You are to implement the SortMerge class. You need to implement the constructor, and two public methods, get_next() and close(). See the JavaDoc for the API. These are part of Java Minibase's iterator package, and should be declared as such.

Warning: Yes, Java Minibase has an iterator package and implements an Iterator class. So does java.util! Clearly, you cannot import both at the same time. Javac will rightfully complain, if you try. There is no occasion where you will need java.util.Iterator for the project. This is just so that you are aware of the issue.

The SortMerge constructor joins two tables, call them R and S, represented by the Iterator object am1 and am2, respectively, using the merge join algorithm.

Read carefully about the TupleUtils class. You will need to call the setup_op_tuple method for "formatting" and handling the joined records. You will also need to call the compare functions to compare the join columns of records from R and S. You will call the Sort constructor to sort a table (represented by Iterator class).

The main method of SortMerge is get_next(). It returns the next joined tuple. Note that SortMerge acts like a Java iterator. It returns type Tuple. When there are no more tuples from the join output, it returns null.

Complication: There may be duplicate values in the input stream of tuples, both in the inner and the outer streams. Consider a join of Sailors and Reserves on sid. Clearly one sailor may have many reservations.

How do you handle this?

You need to collect the group of all tuples from the inner input stream who match on the join column value (collection A) while your algorithm is producing the join matches with that A record. Then, if the next A record has the same join column(s) value(s), you will use the B group that you just collected to find the matches with this new A record.

Where do you temporarily store this group of B records, all of the same join value? In a heap file on disk?

No. That would be quite expensive I/O-wise if we always did this!

In the buffer pool?

No. (Although that could be made into a reasonable approach. However, we would have change much in Minibase to do this.) Instead, we allocate a temporary tuple-buffer for the B group. (Note that "buffer" here has nothing to do with "buffer pool" necessarily!) Use the functions provided by the existing IoBuf class for this. IoBuf is well behaved. It keeps the group in main memory while it has room. If the group being stored gets too big, it creates a tempory heap-file on disk to put them. So IoBuf handles the sticky issue of where the B group should be temporarily stored for you. This issue happens quite a bit in database-system implementation. The IoBuf class is a useful mechanism in Minibase that provides a solution.

Finally, you need to implement close(), which cleans up. The outside code that has been calling get_next() of the SortMerge object is responsible for calling close() once done. (Look at the tester SM_JoinTest.java to see the sequence of calls.)

 
  Where to Find the Makefiles, Code, etc.

Please copy all the files from /eecs/course/4411/mj/, on red.cs.yorku.ca or any of the York PRISM machines, into a local directory of your own, say mj/. You can do this easily, for instance, by saying

% cp -pr /eecs/course/4411/mj/src mj

where the second argument, mj, is the directory name of the directory that you are creating. In mj, there are two sub-directories.

  • iterator/
    • The code you write for SortMerge and the Iterator package goes here.
  • tests/
    • Makefile, TestDriver.java, and SM_JoinTest.java: Sort-Merge-Join test driver program. Note that you use this Makefile to run the test suite on your bufmgr code.

Again, you can find the interfaces for all necessary interfaces in the Javadoc Minibase documentation.

 
  What to turn in (and when)

You will use submit to submit your code, together with copies of the output produced by running the tests provided.

Due by 11:59pm Tuesday 4 April 2017.

Submit as

% submit 4411 mj mj

where the second keyword mj/ is the name of the directory in which you put your code as specified. (The first keyword “mj” in the submit command above specifies the project name.)

For documentation, you do not need to produce javadoc, and you do not need to comment extensively, line-by-line. You should comment your code though, commenting the methods, commenting the fields, and placing relevant comments in the code where things would not be clear otherwise.

Include a small text file named report.txt. In it, state the team members — just yourself, if you did not work with others — and what contributions each person made. State any problems you encounterd with the project, and briefly outline to approach of your solution. I do not expect this to be long; likely, half a page.

For coding style, choose a reasonable indentation convention and be consistent. Name your variables with understandable names. And follow general good coding practise. I will not grade the style component with respect to specific style guidelines, but rather with respect to general good coding practise.

A portion of the project grade will be based on coding style. Namely, your code should be readable.

 
  More Information

See the FAQ (Frequently Asked Questions) for the merge-join project for more information.

Read over the FAQ to start. There is already quite a bit covered in it. Not everything in the FAQ will make sense to you in the beginning. Don't worry about that. Each of those issues will become clear as your project proceeds. Refer back to the FAQ as needed.

Of course, send reasonable questions my way, as needed. Pertinent questions sent my way regarding the project will make their way into the FAQ. So check the FAQ online periodically for updates.

You may discuss the project with others in the class. The work you turn in has to be your own.

Parke Godfrey
*With much thanks and credit to the many folks
at the University of Wisconsin
who built Minibase and who designed these projects.