SYLLABUS for CSE-6490B, Issues in Information Integration, winter 2006, York University

CSE-6490B
Issues in Information Integration
York University
Fall 2008

Syllabus

Instructor:	Parke Godfrey
Office:	#2050 CSE
Office Hours:	Mon 1-3pm
Office Hours:	& by appointment / availability
Ph#:	416-736-2100 x66671
e-mail:	godfrey@cse.yorku.ca

Term:	Fall 2008
Time:	We 2:30-5:30pm
Place:	CB #129

The Course

The Topic

Database technology has been quite successful for organizing and storing very large amounts of information, and providing efficient access to this information via powerful query languages. A database management system also helps to protect the integrity of its data.

Unfortunately, the information we might need often is not located at a single source, in a single database, or we do not know in which source the information resides. The answer to a query may need to compose information from various sources. This is called information integration.

Information integration technology and research is becoming increasingly vital. Companies maintain many databases and information sources. They have operating tasks that require various degrees of integration. Governments need integration for everything from sercurity to offering better services. The world wide web can be viewed as a vast collection of information sources which we would like to extract information collectively.

Course Objectives and Content

In this course, we shall study topics, research, and developments in information integration. It will be conducted much as a readings course. Each class, we shall cover and discuss one key paper and its topic, and additional related papers (recommended reading).

We start from a formal perspective, for two reasons: a formal foundation helps to elucidate the problems and potential solutions; and this is needed to understand much of the research in information integration.

Tentative topics to be covered are as follows.

Introduction
1. What is the problem?
  - The center cannot hold.
  - Information is distributed over different sources. It is not all in one place under one schema.
2. What are solutions?
Background
1. Datalog: A Primer
  1. the data model
  2. evaluation strategies (top-down and bottom-up)
  3. negation, open and closed world assumptions
  4. answer set semantics
2. SQL-3
  - recursion
Mediation
1. Query Answering over distributed data sources [semantics]
  1. query rewrite
  2. query containment and query folding
  3. semantic query caching
2. Schema and Data Integration and Mapping [frameworks]
  1. mediators and wrappers
    - global-as-view vs. local-as-view
  2. schema mapping and integration
    - source-to-target dependencies
    - schematic discrepancies
  3. peer to peer approaches
  4. conflict resolution, repairs, and consistent query answers
    - management of uncertain and disjunctive information
    - cooperative answering systems
3. Query Optimization & Caching for Mediators [implementation]
  1. semantic query optimization
  2. dynamic query processing
Data Models and Semi-structured Data
1. XML
  1. XML data model
  2. XPath and XQuery
  3. XML integrity constraints
  4. XML query processing
  5. description logics behind XML
2. storing semi-structured data in relational DBMSs
3. search versus queries
Ontologies and Semantic Web
- RDF
- web ontology
- OWL
Application Domains
1. bio-informatics
2. emergency and rescue
3. ...

This is tentative and no doubt will adjust during the course. In reality, we probably cannot cover all those topics. Furthermore, we shall need to adjust dynamically, based on everyone's interests.

Readings

There is no textbook. There will be a primary paper (journal or conference article or book chapter) assigned for each class session. These will be made available on the class webpage. There will often be a second, recommended paper each time. (The discussion leader and responder will be responsible covering for these.)

The other material will be class notes (as lecture slides), tutorial notes, etc., which will be covered in class and made available on the class webpage.

Grading Criteria / Course Requirements

Components

	Percentage	When
Assignments	3 × 10% = 30%	over the term
Classtime Activities	25%	over the term
discussion leader	10%
responder	10%
participation	5%
Report or Project	25%	due at end of classes
Final Exam	20%	takehome, due in exam period

The grading policy is standard. Discussion is fine, but your assignment work must be your own.

Class attendance will not be monitored, but is important since the class will involve readings and discussion. Each person will serve as a discussion leader one time for a given topic. He or she will need to read that day's paper (and "recommended" paper) thoroughly. The leader will present a brief summary of the papers with the important issues, results, and open problems. He or she will then lead the discussion. Each person will also serve as a responder one time for a given topic. The responder also will need to read that day's paper (and "recommended" paper) thoroughly. He or she prepares questions for the leader.

Each student may choose to do either a term report or a term project. A term report will be like a technical report and will summarize a specific topic. (The ballpark is that it would be 12 to 15 pages.) A term project would involve implementing something / building a prototype, or applying existing information integration tools to a specific problem domain. Term projects can be done in teams (up to two people), of course with the size of the project being commensurate. A team project will receive a single grade. Report / project topics are to be negotiated with the instructor.

There will be a final exam worth 20% that will be takehome to be done during the exam cycle. It will have exercises and questions pertinent to the readings and course lectures and materials. It will be open "book", but, of course, without consultation with others.

York University's rules for academic honesty and plagiarism are always in effect. ( See below.) Discussion is fine on the assignments and projects. However, collaboration is not. The work must be your own.

Tentative Schedule

Week#	day	topic
#1	3 Sept	Introduction & Background
#2	10 Sept	Logic & Non-monotonicity
#3	17 Sept	rewrite, folding, & containment
#4	24 Sept	mediation, GAV, & LAV
	1 Oct	no class (Rosh Hashanah)
#5	8 Oct	schema mapping & peer-to-peer
#6	15 Oct	query optimization for mediators
#7	22 Oct	query caching for mediators
#8	29 Oct	semi-structured data
#9	5 Nov	XML
#10	12 Nov	XPath & XQuery
#11	19 Nov	search versus query
#12	26 Nov	semantic web & ontologies / applications

Policies

Academic Integrity / Honesty / Plagiarism

The Department of Computer Science (& Engineering) Academic Honesty Guidelines are in effect for this course, as, indeed, they are for any CS&E course.

Plagiarism is defined as taking the language, ideas, or thoughts of another, and representing them as your own. If you use someone else's ideas, cite them. If you use someone else's words, clearly mark them as a quotation. Note that plagiarism includes using another's computer programs or pieces of a program. All noted instances of plagiarism will be reported.

These policies are not intended to keep students from working with other students. One can learn much working with others, so this is to be encouraged. Should you encounter any situations for which you are uncertain whether the collaboration is permitted or not, please ask.