|
CSE-6421: Advanced Database Systems
York University
Winter 2010
|
StL Database
|
|
|
|
Saint Lawrence Booksellers Database (StL)
|
|
Welcome
to the StL database.
a small data warehouse (DW)
consisting of synthetic (that is, made up) data
for a small, boutique web-based bookseller,
Saint Lawrence Booksellers, Inc.
The StL DW is reasonably large for a toy database,
but still quite small in comparison
to many real-world databases and data warehouses.
The data was synthetically generated
using a model-driven, schema-based generator
with different distributions for different fields.
Interesting patterns are built into the data
that can be data mined.
|
|
|
|
The
schema of StL-DW consists of three tables:
Customer
|
cust#
|
BIGINT
|
country
|
CHAR(3)
|
state
|
CHAR(2)
|
city
|
VARCHAR(15)
|
|
Book
|
book#
|
BIGINT
|
language
|
VARCHAR(10)
|
genre
|
VARCHAR(15)
|
publisher
|
VARCHAR(25)
|
price
|
DECIMAL(5, 2)
|
|
Purchase
|
cust#
|
BIGINT
|
book#
|
BIGINT
|
when
|
DATE
|
qnty
|
SMALLINT
|
sale
|
DECIMAL(5, 2)
|
|
|
The underlined attributes show the primary keys.
In the language of data-warehousing,
Customer
and
Book
are dimension tables,
and
Purchase
is the fact table.
Customer has 100,000 rows,
Book has 14,831 rows,
and
Purchase has 6,838,088.
There is a foreign key
from Purchase to Customer
and from Purchase to Book.
In Purchase,
cust#, book# and when
are the dimension columns.
The measure column is sale.
It reports how much the sale was for.
Each sale is equal to the book's
price * qnty.
The column qnty indicates how many copies of this book
was bought in that given purchase.
In most cases, qnty = 1.
Note that sale has already accommodated qnty!
So, in your queries, you should not multiply by qnty.
Since StL is a web-based book-seller,
we mail the books to the customers' addresses.
So one can think of "customer" as a location.
Call that dimension location then.
That dimension offers a natural roll-up:
country > state >
city > cust#.
The next dimension is book.
Books have three attributes of interest:
language (what language it is written in),
genre (what category it is in, e.g., humor, politics, etc.),
and
publisher (what company publishes it).
These form three dimensions on book,
and offer another natural cube
(treating Book itself as a small fact table).
The last dimension is time,
which is represented by the attribute when
in Purchase.
It is of type DATE.
So it does not show the time of a purchase,
but reports the day of purchase.
Notice there is no table Time,
but we could pretend there is a virtual table for Time.
E.g.,
Time
|
year
|
YEAR
|
month
|
MONTH
|
day
|
DAY
|
|
Thus, the time dimension provides for a natural roll-up.
Standard SQL provides many functions for dealing with DATE
and TIMESTAMP values.
In fact,
we can roll up when in many ways.
There are no NULL values in the data warehouse.
|
|
|
|
The
database is accessable via DB2 on red.cs.yorku.ca
or any of the PRISM machines,
under PRISM's default IBM DB2 server (a DB2 v9 instance).
The database is named STL.
So that is what you need to connect to.
- % db2 connect to stl
The three tables are in a schema named stl.
So to access them, one needs to prepend "stl."
in front of the table name.
- % db2 "select count(*) from stl.purchase
Or instead, you can set stl as the default schema
for each of your sessions.
- % db2 set schema stl
With that, you need not prepend each of your table calls
with "stl.".
Either way is fine.
See the other instruction pages for DB2 off of the class homepage
for more help.
|
|
|
|
For
each query,
put the SQL for the query into a file
with the same name as the query (e.g., Toppers).
You will submit these files.
Design your queries carefully.
Do them in steps, and debug along the way.
Also reuse (parts of) queries
once you have them working
in subsequent queries.
Most correct, well-done queries for this project take 30 seconds
to a couple of minutes to execute.
Be patient.
For real DWs, query execution time is often hours.
Your query must complete within five minutes for full credit.
|
|
1. Sales
<year, month, sales>
The sum of sales
(total of sale from Purchase)
per month.
Order by year asc, month asc.
answer table
- Notes
-
|
|
2. Multiple
<cust#, book#, genre#,
language#, #copies>
Customers who have bought the same book multiple times
over different purchases,
and the number of different times the person has bought the book
(when)
is seventy or more.
Report in #copies the total number of copies of the book
(qnty) the person bought.
Note that a customer might buy several copies of a books (qnty)
in a purchase;
however, if the person does not buy the same book again later,
this is not counted.
In each case, the book, its genre and language,
and the number of copies bought overall by the customer
is reported.
Order by
cust# asc,
book# asc,
genre asc.
answer table
- Notes
-
|
|
3. Toppers
<language, city, sales>
For each language,
the city that has the maximum sales
(total of sale from Purchase)
If, for a language,
there is a tie for maximum over several cities,
all cities in the tie are listed.
Order by language asc, city asc.
answer table
- Notes
Break the query into steps using the with clause.
Joining again to retrieve the city names is virtually
unavoidable, unless one were to do this
using SQL's OLAP aggregation extensions.
|
|
4. Sweep
<cust#, language, genre,
publisher>
Customers who have bought all the books
that StL has available within a language-genre-publisher category
that contains more than ten books.
For each, list the language-genre-publisher category.
Order by
cust# asc, language asc, genre asc,
publisher asc.
answer table
- Notes
Writing a query that will execute efficiently is
a challenge here.
Consider using aggregation to count
the number of titles (book#)
there are in each language-genre-publisher category,
and count the number of titles each customer
has bought in each language-genre-publisher category,
|
|
5. Same
<exemplar, #members, #titles,
genre>
A group of customers who have each bought exactly the same books
as each other within the same genre category,
and the number of books bought in common is ten or greater.
Identify each group by the smallest cust# in the group.
List the number of members of
the group (two or more), the number of book titles (#book)
they have each commonly bought (ten or more),
and the genre category the books are from.
Order by exemplar asc, genre asc.
answer table
- Notes
This query is quite a bit harder than the previous ones,
but do not despair.
-
A large issue with this query is how to right a query
that will execute reasonably efficiently.
There are many possible queries that will be logically correct,
but that would take hours, if not longer,
to complete.
To get a handle on how to do this query,
one needs to think out intermediate steps.
The "views" (common table expressions)
that I had in my with clause
when I wrote a query for this were as follows.
- CustTitle:
A table of cust#, genre, and book#
that reports distinct book titles (book#)
a customer has bought by genre.
It is not necessary to have this,
but makes the rest cleaner to write.
- CustPurchase:
Counts the number of titles (distinct book#)
as #titles
bought by each customer per genre.
Also has a column hash.
(See next hint for this.)
- CandidatePair:
Matches up pairs of customers per genre
who have bought the same number of titles within that genre
as one another.
(And hash has to match for the two.)
This is a candidate pair,
but it could be a false positive.
The two customers still may not have bought the very same
books, just the same number of books.
- Differ:
Of the candidate pairs,
selects out the pairs
where the first customer has bought some book in the genre
that the other customer has not.
(These are the false positives.)
- Pair:
Simply the candidate pairs minus
the false positives found in Differ.
- Exemplar:
Selects from the pairs just those
for which the first customer
(who has the smaller cust# of the pair, by design)
has the smallest cust#
(the exemplar)
of the clique.
- main:
Simply groups by exemplar and counts the number of members.
Of course, this is just one way to approach the query.
You certainly do not have to do it this way.
For the query to be reasonably efficient,
one must effectively limit the number of candidate pairs
that will be considered.
There could be lots of customers who bought the same
number of books within a genre as another customer,
but only in a few cases would these be the very same books.
If there are too many candidate pairs to consider,
the query will be very expensive.
One trick would be to use a "hash" signature
of the set of books (book#)
that the customer has bought.
Then, only if the two customers' hashes are the same
is it possible that it is the same set of books.
The book# column is of type bigint.
So I used the aggregate standard deviation (stddev)
as my hash.
|
|
|
|
ByMonth
The roll-up over YEAR > MONTH > DAY
over Purchase showing the percentage of sales to all
and showing the percentage of sales within that
category, e.g., for 1999-11-14, show the percentage of sales within
1999-11-ALL.
The difficulty this presents is how to find the ratio (percentage)
of sales of the given roll-up instance to immediate category
containing it (its parent roll-up).
Once we compute the roll-up table,
the information about the parent roll-ups we need
are in the same table.
So one solution is that we can join this table to itself to find the ratios.
This presents the difficulty that the join conditions are tricky:
some of the values have been rolled up and are displayed as "-".
These are represented as NULL values,
and NULL matches with nothing.
So we must keep track of the level of rolling up
that has occurred (for a given tuple reported)
and use this information.
-
with
Roll (year, month, level, sales) as (
select year(when),
month(when),
grouping(year(when))
+ grouping(month(when)),
sum(sale)
from stl.Purchase
group by rollup(year(when), month(when))
),
Total (year, month, sales) as (
select year, month, sales
from Roll
where level = 2
)
select C.year,
C.month,
decimal(float(C.sales) / P.sales, 6, 5) as local_percent,
decimal(float(C.sales) / T.sales, 6, 5) as total_percent
from Roll P, Roll C, Total T
where (P.level = 2 OR P.year = C.year)
and P.level = C.level + 1
union all
select year, month, 1.0 as local_percent, 1.0 as total_percent
from Total
order by year, month;
|
|
|
|
You
may discuss the queries with others.
Strategizing with others is fine.
However,
the work you turn in must be your own.
A "no carry" / "no cut-and-paste" policy should be observed.
You should not carry away someone else's actual SQL code.
|
|
|
|
A
submit will be set up so that you can
electronically submit your project work.
Put your SQL code for each challenge in a file
with that challenge's name.
You will submit the five files.
|
|
|
Parke Godfrey
|
|
|