CSE-6490B
Issues in Information Integration

York University
Fall 2008
Syllabus
Instructor: Parke Godfrey
Office: #2050 CSE
Office Hours: Mon 1-3pm
& by appointment / availability
Ph#: 416-736-2100 x66671
e-mail: godfrey@cse.yorku.ca
Term: Fall 2008
Time: We 2:30-5:30pm
Place: CB #129
  The Course
The Topic

Database technology has been quite successful for organizing and storing very large amounts of information, and providing efficient access to this information via powerful query languages. A database management system also helps to protect the integrity of its data.

Unfortunately, the information we might need often is not located at a single source, in a single database, or we do not know in which source the information resides. The answer to a query may need to compose information from various sources. This is called information integration.

Information integration technology and research is becoming increasingly vital. Companies maintain many databases and information sources. They have operating tasks that require various degrees of integration. Governments need integration for everything from sercurity to offering better services. The world wide web can be viewed as a vast collection of information sources which we would like to extract information collectively.

Course Objectives and Content

In this course, we shall study topics, research, and developments in information integration. It will be conducted much as a readings course. Each class, we shall cover and discuss one key paper and its topic, and additional related papers (recommended reading).

We start from a formal perspective, for two reasons: a formal foundation helps to elucidate the problems and potential solutions; and this is needed to understand much of the research in information integration.

Tentative topics to be covered are as follows.

  1. Introduction
    1. What is the problem?
      • The center cannot hold.
      • Information is distributed over different sources. It is not all in one place under one schema.
    2. What are solutions?
  2. Background
    1. Datalog: A Primer
      1. the data model
      2. evaluation strategies (top-down and bottom-up)
      3. negation, open and closed world assumptions
      4. answer set semantics
    2. SQL-3
      • recursion
  3. Mediation
    1. Query Answering over distributed data sources [semantics]
      1. query rewrite
      2. query containment and query folding
      3. semantic query caching
    2. Schema and Data Integration and Mapping [frameworks]
      1. mediators and wrappers
        • global-as-view vs. local-as-view
      2. schema mapping and integration
        • source-to-target dependencies
        • schematic discrepancies
      3. peer to peer approaches
      4. conflict resolution, repairs, and consistent query answers
        • management of uncertain and disjunctive information
        • cooperative answering systems
    3. Query Optimization & Caching for Mediators [implementation]
      1. semantic query optimization
      2. dynamic query processing
  4. Data Models and Semi-structured Data
    1. XML
      1. XML data model
      2. XPath and XQuery
      3. XML integrity constraints
      4. XML query processing
      5. description logics behind XML
    2. storing semi-structured data in relational DBMSs
    3. search versus queries
  5. Ontologies and Semantic Web
    • RDF
    • web ontology
    • OWL
  6. Application Domains
    1. bio-informatics
    2. emergency and rescue
    3. ...

This is tentative and no doubt will adjust during the course. In reality, we probably cannot cover all those topics. Furthermore, we shall need to adjust dynamically, based on everyone's interests.

Readings

There is no textbook. There will be a primary paper (journal or conference article or book chapter) assigned for each class session. These will be made available on the class webpage. There will often be a second, recommended paper each time. (The discussion leader and responder will be responsible covering for these.)

The other material will be class notes (as lecture slides), tutorial notes, etc., which will be covered in class and made available on the class webpage.

 
  Grading Criteria / Course Requirements
Components

  Percentage When
Assignments 3 × 10% = 30% over the term
Classtime Activities 25% over the term
discussion leader 10%  
responder 10%  
participation 5%  
Report or Project 25% due at end of classes
Final Exam 20% takehome, due in exam period

The grading policy is standard. Discussion is fine, but your assignment work must be your own.

Class attendance will not be monitored, but is important since the class will involve readings and discussion. Each person will serve as a discussion leader one time for a given topic. He or she will need to read that day's paper (and "recommended" paper) thoroughly. The leader will present a brief summary of the papers with the important issues, results, and open problems. He or she will then lead the discussion. Each person will also serve as a responder one time for a given topic. The responder also will need to read that day's paper (and "recommended" paper) thoroughly. He or she prepares questions for the leader.

Each student may choose to do either a term report or a term project. A term report will be like a technical report and will summarize a specific topic. (The ballpark is that it would be 12 to 15 pages.) A term project would involve implementing something / building a prototype, or applying existing information integration tools to a specific problem domain. Term projects can be done in teams (up to two people), of course with the size of the project being commensurate. A team project will receive a single grade. Report / project topics are to be negotiated with the instructor.

There will be a final exam worth 20% that will be takehome to be done during the exam cycle. It will have exercises and questions pertinent to the readings and course lectures and materials. It will be open "book", but, of course, without consultation with others.

York University's rules for academic honesty and plagiarism are always in effect. ( See below.) Discussion is fine on the assignments and projects. However, collaboration is not. The work must be your own.

 
  Tentative Schedule

Week# day topic
#1 3 Sept Introduction & Background
#2 10 Sept Logic & Non-monotonicity
#3 17 Sept rewrite, folding, & containment
#4 24 Sept mediation, GAV, & LAV
  1 Oct no class (Rosh Hashanah)
#5 8 Oct schema mapping & peer-to-peer
#6 15 Oct query optimization for mediators
#7 22 Oct query caching for mediators
#8 29 Oct semi-structured data
#9 5 Nov XML
#10 12 Nov XPath & XQuery
#11 19 Nov search versus query
#12 26 Nov semantic web & ontologies / applications

 
  Policies
Academic Integrity / Honesty / Plagiarism

The Department of Computer Science (& Engineering) Academic Honesty Guidelines are in effect for this course, as, indeed, they are for any CS&E course.

Plagiarism is defined as taking the language, ideas, or thoughts of another, and representing them as your own. If you use someone else's ideas, cite them. If you use someone else's words, clearly mark them as a quotation. Note that plagiarism includes using another's computer programs or pieces of a program. All noted instances of plagiarism will be reported.

These policies are not intended to keep students from working with other students. One can learn much working with others, so this is to be encouraged. Should you encounter any situations for which you are uncertain whether the collaboration is permitted or not, please ask.