CSE-2041A
Net-Centric Computing

York University
Fall 2012
Lab #2: Databases
Intergalactic Data Speak
Posted: 18 September 2012
Revised: 24 September 2012
  JDBC, Derby, & ij

Use the ij utility and your assigned username and password to connect to the database at the URL

jdbc:derby://indigo.cse.yorku.ca:9999/CSE

To connect to the database, once in the ij shell, type

connect 'jdbc:derby://indigo.cse.yorku.ca:9999/CSE;user=cseNNNNN;password=XXXX';

where the NNNNN is the rest of your CSE account, and the XXXX are the last four digits of your York student number. (This Derby account is not the same as your CSE account. I just set it up to use the same names.)

The jdbc is the scheme that identifies the protocol. (This is like HTTP is for Web.) The derby is the sub-scheme. A URL most often does not have one of these, but it may. In this case, it identifies that we are connecting to a Derby database system. Then, indigo.cse.yorku.ca is the host, the server machine where the service is running, and the 9999 indicates the port where it listens. The path, CSE, in this case, indicates the database to which we are connecting (that the Derby database system is managing).

Following the path are the parameters (attribute-value pairs) that are passed to the service. In this case, we are passing just two: a user name and a password for that user. In a URL, the path and the parameters are usually separated by a “?”. It is a syntactic quirk that they need to be separated, in this case, by a semicolon (“;”), instead. The parameters are separated by semicolons, which is usual.

In ij, remember to type the URL of the database to which you are connecting in single quotes (“'”), as above. And end the command with a semicolon. Every command in the ij session needs to be ended with a semicolon. If ij is just sitting there seemingly waiting after you have hit return on a command, expecting a response, check that you typed the semicolon. The nice thing is that your command can span a number of screen lines. The command is not issued until the semicolon appears.

Next, in your ij session, issue the command

set schema roumani;

This sets the schema you use as default to roumani. A database schema is just a namespace in a database, used to group collections of related tables.

The table we are interested in exploring is called sis, and is within schema roumani. Issue

describe sis;

to see the format ( called the schema of the table — sorry, yet another use of the word “schema”!) of the table sis.

You will be issuing SQL queries directly to the database system from your ij session.

When you are finished with the session, to disconnect from the database, do

disconnect;

To leave the session — the ij shell — say

exit;
 
  SQL Primer

See SQL Tutorial & Another SQL Tutorial for good introductions to SQL. SQL is a standardized query language for retrieving information from a relational database (and for updating information in the database, and manipulating databases too).

We will be querying — retrieving information — from a database (CSE), specifically from a table sis, which holds student records.

The basic syntax of an SQL query is

select list-of-columns
from list-of-tables
where list-of-conditions;
		

In the select clause, we specify the columns — the attributes — we want in the answer table. The shorthand of “*” is used here if we want all the columns of the table(s) back.

In the from clause, we specify which tables the information is to draw from. For our queries, it will be just a single table: sis.

The where clause specifies conditions that the rows from the table(s) must satisfy to be returned in the answer table. For instance, GPA > 8.0 would find just students with a GPA above 8. Conditions can be combined with and, or, not, and other logical operators, to make complex conditions. The where clause is optional; if we leave it out, all the rows are returned.

SQL does not have to order the rows in the answer table in any particular way. Usually, we do want them ordered somehow, so we can interpret the results more easily. An order by column-list can be added at the end of the query for this.

For example, say we want to find all computer science majors who have a GPA of more than 8. Let's report the columns ID, surname, givenName, GPA, and city.

select ID, surname, givenName, GPA, city    
from sis
where major = 'Computer Science' and GPA > 8.0
order by surname, givenName, ID;
		

Aggregation is when we want to consolidate results. For instance, we simply might wish to know how many students there are from each city. SQL accommodates this with aggregate functions for return columns, and with two more clauses, group by and having.

select list-of-columns
from list-of-tables
where list-of-conditions
group by list-of-tables
having list-of-conditions.
		

The group by lets us specify which are the grouping columns, for which we want to know aggregate information. In our example of number of students per city, this is city. The having clause is like the where clause. It lets us filter out the aggregated answer rows in the answer table by conditions. Notice this could not be accomplished in the where clause! Like the where clause, the having clause is optional. If we have nothing filter, we can leave it off.

If the group by clause is left off but aggregate column functions are used in the select — so it is an aggredate query — a single answer row is returned, representing the aggregate over the whole input table.

Again, an order by clause can go at the end, to order the answer rows.

So, the number of students per city?

select city, count(*) as students
from sis
group by city
order by city;
		

Notice the as with the name after the aggregate-function column. This is the convention to give a name to the column in the answer table.

This is the tip of the iceberg for SQL. There are many more types of complex queries possible. For instance, we can retrieve information from more than one table. This usually involves joining information from each. But this offers a good basis.

Also, anywhere a table name can appear in a query, we can put another query instead! Since the answer table to a query has the same form as any other table, this works. Such is called a sub-query, and it just goes between “(” & “)”. (The SQL parser can be picky, though. A sub-query in the from usually ought to be named. E.g., (...) as myTable.

Here is another useful construct for the lab. The clause

fetch first k rows only;

can be added as the last clause (after the order by), where k is a positive integer, like 5. This only returns the first k rows of the answer table. This is useful when just playing about, and you do not want huge numbers of rows thrown back at you.

 
  A. SQL: Retrieval

These query exercises retrieve information from the database, and report it in specified ways.

  1. list

    1. Retrieve all the records.

    2. Do the same, but now order by major, surname, givenName.

  2. selection

    Retrieve the record for the student Dante Golden (givenName and surname, respectively).

  3. projection

    List the majors (major).

    Show just major. Do not have duplicate rows!

    Order by major.

  4. range

    1. Retrieve the student records for students with a surname between Go and Gu, inclusive.

      Just show ID, surname, givenName, yearAdmitted.

      Order by surname, givenName, ID.

    2. Do the same query, but find instead the students whose surname's second letter is y.

  5. compound conditions

    1. Retrieve the students majoring in Computer Science, who entered in 2008 or before, and who have a GPA of 6.5 or higher.

      Show surname, givenName, birthdate, GPA.

      Order by surname, givenName, ID.

    2. Do the same query, but find Computer Science majors who entered in 2008 or before, or who have a GPA of 6.5 or higher.
 
  B. SQL: Aggregation

These query exercises aggregate information from the database, to see patterns in the data.

  1. full-set aggregation

    1. How many students are in the table?

    2. Do the same again, but name the return column students.

  2. group by

    List per major the average GPA (avg_gpa).

    Show major, avg_gpa.

    Order by major.

  3. compound group by

    List per major and year admitted the average GPA (avg_gpa).

    Show major, yearAdmitted, avg_gpa.

    Order by major, yearAdmitted.

  4. having

    List the major, the number of students in the major (students), and the average GPA of the major (avg_gpa) for majors with ten or more students.

    Order by major.

  5. sub-query

    Show the student with the highest GPA for each major.

    Show ID, surname, givenName, major, GPA.

    Order by major, ID.

    Might more than one student be reported for a given major? Why?

 
  SQL Examples

SQL is quite expressive. Here are a couple of examples that illustrate a query similar to #10 above and a query similar to the one needed for the lab report.

Find the major(s) per year admitted with the most students still enrolled.

This is hard because it requires an aggregation over an aggregation: 1) we need to know how many (current) students there are per year admitted per major; 2) then, once we know this, we can find, for each year admitted, the largest population over the majors. We could handle these two steps using a sub-query. The sub-query does 1), then the query finds 2).

select yearAdmitted, max(numInMajor) as most
from (  
        select yearAdmitted, major, count(id) as numInMajor
        from sis
        group by yearAdmitted, major
    ) as Y  
group by yearAdmitted;
		

A second difficulty is that we want to report back which major per year admitted has the largest population (numInMajor). Of course, there may not be a single major with the most for a given year admitted, since several majors could tie for most; in cases of ties, we ought to report all the top ones. We cannot modify the query above easily for this. We cannot put major in the select, as it is not part of the group identifier (in the group by). We cannot add it to the group by; this would change the meaning of our query!

To get around this, we have to join our information back with the information of yearAdmitted, major, numInMajor (as in sub-query Y). No problem. We can use additional sub-queries to pull this off.

select M.yearAdmitted, M.major, H.most
from (
        select Y.yearAdmitted, max(Y.numInMajor) as most
        from (
                select S.yearAdmitted, S.major, count(id) as numInMajor
                from sis S
                group by S.yearAdmitted, S.major
            ) as Y
        group by Y.yearAdmitted
    ) as H,
    (
        select S.yearAdmitted, S.major, count(id) as numInMajor
        from sis S
        group by S.yearAdmitted, S.major
    ) as M
where H.yearAdmitted = M.yearAdmitted
    and H.most = M.numInMajor;
		

Note that our from now lists two sources (sub-queries Y and M). Before, we have been looking at just queries that range over a single source (e.g., sis). The query joins information from the two sources. The join conditions are stated in the where clause. In this case, we want to consider pairs of rows from Y and M that are for the same yearAdmitted, and then to keep just the results for those majors that have the most students for that year admitted (that is, where H.most = M.numInMajor).

Find the student(s) with a birthdate, before or equal, to a provided date.

select surname, givenName, birthdate
from sis S,
    (
        select max(B.birthdate) as nearestBDay
        from sis B
        where B.birthdate <= '1990-03-29'
    ) as N
where
    birthdate = nearestBDay
order by surname, givenName;
		

We use aggregation, max, to find the birthday some student has that is before, or equal to, our given date. It is simply the latest birtdate in the table before, or equal to, our date. The sub-query N does this.

We need to join this with table sis to pick up the student's name — or students' names, in the case of a tie.

 
  Lab Report

Send and e-mail to godfrey@acm.org with an SQL query that does the following.

  • Retrieves the student record that would come just before yours, if we were to insert a record for you in the table, as ordered by surname. Show ID, surname, givenName, major.

    If there were two or more students with the same surname that would be just before you, it is fine to list both, or more, records.

    E.g., for Godfrey, the returned results would be

    ID       |SURNAME        |GIVENNAME      |MAJOR                                   
    ----------------------------------------------------------------------------------
    200971224|Glover         |Alexander      |Business                                
    200878456|Glover         |Emma           |Latin American & Caribbean Studies
    			
  • For the e-mail's subject, say

    cse-2041 / lab #2 / cseXXXXX

    where cseXXXXX is your CSE account name.