CSE-6490B
Issues in Information Integration

York University
Fall 2013
Syllabus
Instructor: Parke Godfrey
Office: #2050 CSE
Office Hours: We 5-7pm
& by appointment / availability
Ph#: 416-736-2100 x66671
e-mail: godfrey@cse.yorku.ca
Term: Fall 2013
Time: Tu & Th 5:30-7:00pm
Place: Bethune College #228
  The Course
The Topic: Information Integration

Database technology has been quite successful for organizing and storing very large amounts of information, and providing efficient access to this information via powerful query languages. A database management system also helps to protect the integrity of its data.

The information we might need often is not located at a single source, in a single database, Unfortunately, or we do not know in which source the information resides. The answer to a query may need to compose information from various sources. This is called information integration.

Information integration technology and research is becoming increasingly vital. Companies maintain many databases and information sources. They have operating tasks that require various degrees of integration. Governments need integration for everything from sercurity to offering better services. The world wide web can be viewed as a vast collection of information sources which we would like to extract information collectively.

Course Objectives & Content

In this course, we shall study topics, research, and developments in information integration. It will be conducted much as a readings course. Each class, we shall cover and discuss one key paper and its topic, and additional related papers (recommended reading).

We start from a formal perspective, for two reasons: a formal foundation helps to elucidate the problems and potential solutions; and this is needed to understand much of the research in information integration.

The tentative course outline is as follows.

  1. Introduction
    • What is information?
    • What are the issues with integration?
      • The center cannot hold.
      • Information is distributed over different sources. It is not all in one place under one schema.
    • What are solutions?
  2. Semantics
    1. logic
      1. Datalog: A Primer
        1. the data model
        2. evaluation strategies (top-down and bottom-up)
        3. negation, open and closed world assumptions
      2. answer set semantics
    2. Mediation: Integrating Sources
      1. Query Answering over distributed data sources [semantics]
        1. query rewrite
        2. query containment and query folding
        3. semantic query caching
      2. Schema and Data Integration and Mapping [frameworks]
        1. mediators and wrappers
          • global-as-view vs. local-as-view
        2. schema mapping and integration
          • source-to-target dependencies
          • schematic discrepancies
        3. peer to peer approaches
        4. conflict resolution, repairs, and consistent query answers
          • management of uncertain and disjunctive information
          • cooperative answering systems
  3. Data Models & Query Languages
    1. Semi-structured data & XML
      1. XML data model
      2. XPath and XQuery
      3. XML integrity constraints
      4. XML query processing
      5. description logics behind XML
    2. Description Logics
    3. RDF / SPARQL
    4. Search vs. Queries
  4. Issues & Technologies
    1. Distributed Databases
      • The CAP Theorem
      • share nothing
    2. Data Warehouses
      • ETL vs. ELT
      • business intelligence
    3. Integration on-the-fly
    4. Schema Integration
      • ontologies
      • schema integration
    5. XML, XQuery, & the Semantic Web
    6. Big Data / NoSQL
      • MapReduce
      • Hadoop, etc.
  5. Topics
    • to be announced

This is tentative, and no doubt will adjust during the course. In reality, we probably cannot cover all those topics. Furthermore, we shall need to adjust dynamically, based on everyone's interests.

Readings

There is no textbook. There will be a primary paper (journal or conference article or book chapter) assigned for each class session. These will be made available on the class webpage. There will often be a second, recommended paper each time. (The discussion leader and responder will be responsible covering for these.)

The other material will be class notes (as lecture slides), tutorial notes, etc., which will be covered in class and made available on the class webpage.

 
  Grading Criteria / Course Requirements
Components

  Percentage When
Assignments 3 × 5% = 15% over the term
Reading Summaries sum to 10% over the term
Project (group) 20% middle of term
Presentation 20% in 2nd half of term
discussion leader 15%  
responder 5%  
Proposal Report 20% due at end of classes
Final Exam 15% takehome, due in exam period

The grading policy is standard. Discussion is fine, but your assignment work must be your own.

Class attendance will not be monitored, but is important since the class will involve readings and discussion.

The assignments will be small problem sets based on background material that we cover.

Reading summaries are small several-paragraph summaries of assigned papers for reading that also address an assigned question about the paper. Ten of these need to be submitted (for %1 each).

Each person will serve as a discussion leader one time for a given topic. He or she will need to read that day's paper (and “recommended” paper) thoroughly. The leader will present a brief summary of the papers with the important issues, results, and open problems. He or she will then lead the discussion. Each person will also serve as a responder one time for a given topic. The responder also will need to read that day's paper (and "recommended" paper) thoroughly. He or she prepares questions for the leader.

The proposal report is to be written as a research proposal; it should summarize a specific topic, identify an open issue, and propose a course of research to address that issue. (The ballpark is that it would be 6–8 pages in VLDB format.)

The project involves implementing something / building a simple prototype, or applying existing information integration tools to a specific problem domain. projects can be done in teams (up to three people), of course with the size of the project being commensurate. (The project will receive a single grade.)

Report & project topics are to be negotiated with the instructor.

There will be a final exam worth 15% that will be take-home to be done during the exam cycle. It will have exercises and questions pertinent to the readings and course lectures and materials. It will be “open book”, but, of course, without consultation with others.

York University's rules for academic honesty and plagiarism are always in effect. ( See below.) Discussion is fine on the assignments and projects. However, collaboration is not. The work must be your own.

 
  Tentative Schedule

Week# days topic
#1 10, 12 Sept Introduction
#2 17, 19 Sept Semantics: Datalog
#3 24, 26 Sept Semantics: Answer Set Programming
#4 1, 3 Oct Mediation: Query Containment
#5 8, 10 Oct Mediation: Integration
#6 15&17 Oct Other Data Models: Description Logics
#7 22, 24 Oct Other Data Models: XML
#8 29 Oct RDF / SPARQL
#9 5, 7 Nov Distributed Databases
#10 12, 14 Nov Data Warehouses
#11 19, 21 Nov Integration on-the-fly
#12 26, 28 Nov Schema Integration
#13 3, 5 Dec XML, XQuery, & the Semantic Web
  10–23 Dec fall exams

 
  Policies
Academic Integrity / Honesty / Plagiarism

The Department of Computer Science (& Engineering) Academic Honesty Guidelines are in effect for this course, as, indeed, they are for any CS&E course.

Plagiarism is defined as taking the language, ideas, or thoughts of another, and representing them as your own. If you use someone else's ideas, cite them. If you use someone else's words, clearly mark them as a quotation. Note that plagiarism includes using another's computer programs or pieces of a program. All noted instances of plagiarism will be reported.

These policies are not intended to keep students from working with other students. One can learn much working with others, so this is to be encouraged. Should you encounter any situations for which you are uncertain whether the collaboration is permitted or not, please ask.