StL: SQL Jeopardy | CSE-6421 | winter 2010

CSE-6421: Advanced Database Systems
York University
Winter 2010

StL Database

Saint Lawrence Booksellers Database (StL)

Welcome to the StL database. a small data warehouse (DW) consisting of synthetic (that is, made up) data for a small, boutique web-based bookseller, Saint Lawrence Booksellers, Inc. The StL DW is reasonably large for a toy database, but still quite small in comparison to many real-world databases and data warehouses.

The data was synthetically generated using a model-driven, schema-based generator with different distributions for different fields. Interesting patterns are built into the data that can be data mined.

The StL-DW Schema

The schema of StL-DW consists of three tables:

`Customer`
`cust#`	`BIGINT`
`country`	`CHAR(3)`
`state`	`CHAR(2)`
`city`	`VARCHAR(15)`

`Book`
`book#`	`BIGINT`
`language`	`VARCHAR(10)`
`genre`	`VARCHAR(15)`
`publisher`	`VARCHAR(25)`
`price`	`DECIMAL(5, 2)`

`Purchase`
`cust#`	`BIGINT`
`book#`	`BIGINT`
`when`	`DATE`
`qnty`	`SMALLINT`
`sale`	`DECIMAL(5, 2)`

The underlined attributes show the primary keys. In the language of data-warehousing, Customer and Book are dimension tables, and Purchase is the fact table. Customer has 100,000 rows, Book has 14,831 rows, and Purchase has 6,838,088. There is a foreign key from Purchase to Customer and from Purchase to Book.

In Purchase, cust#, book# and when are the dimension columns. The measure column is sale. It reports how much the sale was for. Each sale is equal to the book's price * qnty. The column qnty indicates how many copies of this book was bought in that given purchase. In most cases, qnty = 1. Note that sale has already accommodated qnty! So, in your queries, you should not multiply by qnty.

Since StL is a web-based book-seller, we mail the books to the customers' addresses. So one can think of "customer" as a location. Call that dimension location then. That dimension offers a natural roll-up: country > state > city > cust#.

The next dimension is book. Books have three attributes of interest: language (what language it is written in), genre (what category it is in, e.g., humor, politics, etc.), and publisher (what company publishes it). These form three dimensions on book, and offer another natural cube (treating Book itself as a small fact table).

The last dimension is time, which is represented by the attribute when in Purchase. It is of type DATE. So it does not show the time of a purchase, but reports the day of purchase. Notice there is no table Time, but we could pretend there is a virtual table for Time. E.g.,

`Time`
`year`	`YEAR`
`month`	`MONTH`
`day`	`DAY`

Thus, the time dimension provides for a natural roll-up. Standard SQL provides many functions for dealing with DATE and TIMESTAMP values. In fact, we can roll up when in many ways.

There are no NULL values in the data warehouse.

Accessing the Database

The database is accessable via DB2 on red.cs.yorku.ca or any of the PRISM machines, under PRISM's default IBM DB2 server (a DB2 v9 instance).

The database is named STL. So that is what you need to connect to.

% db2 connect to stl

The three tables are in a schema named stl. So to access them, one needs to prepend "stl." in front of the table name.

% db2 "select count(*) from stl.purchase

Or instead, you can set stl as the default schema for each of your sessions.

% db2 set schema stl

With that, you need not prepend each of your table calls with "stl.". Either way is fine.

See the other instruction pages for DB2 off of the class homepage for more help.

Query Challenges

For each query, put the SQL for the query into a file with the same name as the query (e.g., Toppers). You will submit these files.

Design your queries carefully. Do them in steps, and debug along the way. Also reuse (parts of) queries once you have them working in subsequent queries. Most correct, well-done queries for this project take 30 seconds to a couple of minutes to execute. Be patient. For real DWs, query execution time is often hours. Your query must complete within five minutes for full credit.

1. Sales <year, month, sales>

The sum of sales (total of sale from Purchase) per month.

Order by year asc, month asc.

answer table

Notes
Commercial-calibre database systems provide many column functions for manipulating date-stamp and time-stamp values. year(when): Extracts the year from the time-stamp. month(when): Extracts the month from the time-stamp as a two digit char.

2. Multiple <cust#, book#, genre#, language#, #copies>

Customers who have bought the same book multiple times over different purchases, and the number of different times the person has bought the book (when) is seventy or more. Report in #copies the total number of copies of the book (qnty) the person bought.

Note that a customer might buy several copies of a books (qnty) in a purchase; however, if the person does not buy the same book again later, this is not counted. In each case, the book, its genre and language, and the number of copies bought overall by the customer is reported.

Order by cust# asc, book# asc, genre asc.

answer table

Notes
Most likely, you will need a having clause.

3. Toppers <language, city, sales>

For each language, the city that has the maximum sales (total of sale from Purchase)

If, for a language, there is a tie for maximum over several cities, all cities in the tie are listed.

Order by language asc, city asc.

answer table

Notes
Break the query into steps using the with clause. Joining again to retrieve the city names is virtually unavoidable, unless one were to do this using SQL's OLAP aggregation extensions.

4. Sweep <cust#, language, genre, publisher>

Customers who have bought all the books that StL has available within a language-genre-publisher category that contains more than ten books.

For each, list the language-genre-publisher category.

Order by cust# asc, language asc, genre asc, publisher asc.

answer table

Notes
Writing a query that will execute efficiently is a challenge here. Consider using aggregation to count the number of titles (book#) there are in each language-genre-publisher category, and count the number of titles each customer has bought in each language-genre-publisher category,

5. Same <exemplar, #members, #titles, genre>

A group of customers who have each bought exactly the same books as each other within the same genre category, and the number of books bought in common is ten or greater.

Identify each group by the smallest cust# in the group. List the number of members of the group (two or more), the number of book titles (#book) they have each commonly bought (ten or more), and the genre category the books are from.

Order by exemplar asc, genre asc.

answer table

Notes
This query is quite a bit harder than the previous ones, but do not despair.

A large issue with this query is how to right a query that will execute reasonably efficiently. There are many possible queries that will be logically correct, but that would take hours, if not longer, to complete.

To get a handle on how to do this query, one needs to think out intermediate steps. The "views" (common table expressions) that I had in my with clause when I wrote a query for this were as follows.

CustTitle: A table of cust#, genre, and book# that reports distinct book titles (book#) a customer has bought by genre. It is not necessary to have this, but makes the rest cleaner to write.

CustPurchase: Counts the number of titles (distinct book#) as #titles bought by each customer per genre. Also has a column hash. (See next hint for this.)

CandidatePair: Matches up pairs of customers per genre who have bought the same number of titles within that genre as one another. (And hash has to match for the two.) This is a candidate pair, but it could be a false positive. The two customers still may not have bought the very same books, just the same number of books.

Differ: Of the candidate pairs, selects out the pairs where the first customer has bought some book in the genre that the other customer has not. (These are the false positives.)

Pair: Simply the candidate pairs minus the false positives found in Differ.

Exemplar: Selects from the pairs just those for which the first customer (who has the smaller cust# of the pair, by design) has the smallest cust# (the exemplar) of the clique.

main: Simply groups by exemplar and counts the number of members.

Of course, this is just one way to approach the query. You certainly do not have to do it this way.

For the query to be reasonably efficient, one must effectively limit the number of candidate pairs that will be considered. There could be lots of customers who bought the same number of books within a genre as another customer, but only in a few cases would these be the very same books. If there are too many candidate pairs to consider, the query will be very expensive.

One trick would be to use a "hash" signature of the set of books (book#) that the customer has bought. Then, only if the two customers' hashes are the same is it possible that it is the same set of books. The book# column is of type bigint. So I used the aggregate standard deviation (stddev) as my hash.

Examples

ByMonth

The roll-up over YEAR > MONTH > DAY over Purchase showing the percentage of sales to all and showing the percentage of sales within that category, e.g., for 1999-11-14, show the percentage of sales within 1999-11-ALL.

The difficulty this presents is how to find the ratio (percentage) of sales of the given roll-up instance to immediate category containing it (its parent roll-up).

Once we compute the roll-up table, the information about the parent roll-ups we need are in the same table. So one solution is that we can join this table to itself to find the ratios.

This presents the difficulty that the join conditions are tricky: some of the values have been rolled up and are displayed as "-". These are represented as NULL values, and NULL matches with nothing. So we must keep track of the level of rolling up that has occurred (for a given tuple reported) and use this information.

with Roll (year, month, level, sales) as ( select year(when), month(when), grouping(year(when)) + grouping(month(when)), sum(sale) from stl.Purchase group by rollup(year(when), month(when)) ), Total (year, month, sales) as ( select year, month, sales from Roll where level = 2 ) select C.year, C.month, decimal(float(C.sales) / P.sales, 6, 5) as local_percent, decimal(float(C.sales) / T.sales, 6, 5) as total_percent from Roll P, Roll C, Total T where (P.level = 2 OR P.year = C.year) and P.level = C.level + 1 union all select year, month, 1.0 as local_percent, 1.0 as total_percent from Total order by year, month;

Policies

You may discuss the queries with others. Strategizing with others is fine. However, the work you turn in must be your own.

A "no carry" / "no cut-and-paste" policy should be observed. You should not carry away someone else's actual SQL code.

Submission

A submit will be set up so that you can electronically submit your project work. Put your SQL code for each challenge in a file with that challenge's name. You will submit the five files.


Parke Godfrey