VLDB 2002 Tutorial Program

VLDB 2002

28^th International Conference on

Very Large Data Bases

Kowloon Shangri-La Hotel

August 20-23, 2002

Hong Kong, China

TUTORIAL PROGRAM

Tuesday, August 20

| Tutorial 1 (Sarawagi) | Tutorial 2 (Shasha/Bonnet) |

Wednesday, August 21

| Tutorial 3 (Chakrabarti) | Tutorial 4 (Mohan) |

Thursday, August 22

| Tutorial 5 (Garofalakis/Gehrke/Rastogi) | Tutorial 6 (Nori) |

Friday, August 23

| Tutorial 7 (Faloutsos) |

TUTORIAL 1: TUESDAY, 20 AUGUST 2002, 11:00-13:00

Automation in Information Extraction and Integration

(PDF Presentation Slides - 1.2MB)

Sunita Sarawagi photo

Sunita Sarawagi
Indian Institute of Technology - Bombay, India

OBJECTIVES

Data integration has always been a problem of acute importance in applications like data warehousing. The problem is gaining added momentum with the growing popularity of web portals like Citeseer that lend structure to data gleaned from multiple different web pages. In this tutorial we will discuss how the novel use of techniques from machine learning and data mining can automate the previous manual processes of information extraction, duplicate elimination, schema mapping and missing value substitution.

CONTENTS

The tutorial will focus on two core operations: information extraction and duplicate elimination. We will show how to automate these via the application of classification methods like rule learning and decision trees, and sequence modeling methods like hidden Markov models. Issues like feature design, choice of models, and order of extraction present interesting design alternatives in such cases. Most automated methods require labeled data that again involves manual effort. We will review recent research on exploiting available structured databases and the techniques of semi-supervised and active learning to address the problem of sparse training data.

WHO SHOULD ATTEND

Researchers and professionals involved in data warehouse cleaning, data mining preprocessing tools and Internet portals that integrate web information sources.

ABOUT THE INSTRUCTOR

Sunita Sarawagi does research in the fields of databases, data mining, machine learning and data warehousing. She is a member of the faculty at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her Ph.D. in databases from the University of California at Berkeley.

TUTORIAL 2: TUESDAY, 20 AUGUST 2002, 14:30-16:00 & 16:30-18:00

Database Tuning: Principles, Experiments and Troubleshooting Techniques

(PDF Presentation Slides - 716K)

Dennis Shasha photo

Philippe Bonnet photo

Dennis Shasha
New York University, U.S.A.

Philippe Bonnet
University of Copenhagen, Denmark

OBJECTIVES

To show that database tuning can be distilled into a set of principles that apply across systems.

To explain some of those principles and give evidence from experiments performed on the major vendor systems.

To show the scope of the database tuning problem: from hardware to transaction design to application considerations to schema design.

CONTENTS

Principles: eliminating start-up costs, partitioning, thinking globally.

locking/logging: locking granularity, checkpoint tuning,

hardware: RAID, buffer size, controller cache

communication: ODBC vs. native, user defined functions.

electronic commerce: indexes and communication

data warehousing: aggregate targeting and indexes

troubleshooting tools: query plan, performance monitors, event monitors.

WHO SHOULD ATTEND

Designers of DBMS's, consultants, advanced application developers, professors.

ABOUT THE INSTRUCTORS

Dennis Shasha is a professor at NYU's Courant Institute where he does research on biological pattern discovery for microarrays, combinatorial pattern matching on trees and graphs, database tuning, and database design and algorithms for time series. He has written or co-written seven books, including some fun ones. He has a monthly column of mathematical puzzles in Scientific American and in Dr. Dobb's Journal.

Philippe Bonnet is assistant professor in the computer science department of University of Copenhagen (DIKU), where he does research on database tuning, query processing and data management over sensor networks.

TUTORIAL 3: WEDNESDAY, 21 AUGUST 2002, 11:00-13:00

Text Search for Fine-grained Semi-structured Data

(PDF Presentation Slides - 1MB)

Soumen Chakrabarti photo

Soumen Chakrabarti
Indian Institute of Technology - Bombay, India

OBJECTIVES

Unlike Web search engines, relational query languages do not facilitate schema-less keyword searches. This tutorial will expose the attendees to recent research results which bridge the gap between these extremes. Attendees will learn about indexing, searching, and ranking techniques for graph-structured data with free-form text in columns or nodes of the data.

CONTENTS

Inverted indices, keyword search, vector space model, relevance ranking; social network analysis, prestige ranking; graph models for relational and semi-structured textual data; query models; responding to a keyword query using a subgraph; ranking nodes, formulations based on Steiner trees and biased random walks; relevance feedback in the graph model; search strategies and performance issues; integrating multiple repositories and metadata; user interface issues; research directions.

WHO SHOULD ATTEND

Researchers and builders of systems for searching relational and semi-structured databases using keywords.

ABOUT THE INSTRUCTOR

Soumen Chakrabarti holds a Ph.D. from U.C. Berkeley. Prior to joining IIT Bombay he worked at IBM Research on crawling, searching and mining the Web using its hyperlink graph structure. He has served as a deputy-chair for WWW 2002 and ICDE 2003, as a program committee member for many conferences, including VLDB, ICDE, SIGKDD, SIGIR, WWW, and SODA.

TUTORIAL 4: WEDNESDAY, 21 AUGUST 2002, 14:30-16:00 & 16:30-18:00

Application Servers and Associated Technologies

(PDF Presentation Slides - 4.5MB)

C. Mohan photo

C. Mohan
IBM Almaden Research Center, U.S.A.

OBJECTIVES

Application Servers (ASs), which have become very popular in the last few years, provide the platforms for the execution of transactional, server-side applications in the online world. While transaction processing monitors (TPMs) have been providing similar functionality for over 3 decades, ASs are their modern equivalents. ASs play a central role in enabling electronic commerce in the web context. The objective of this tutorial is to provide an introduction to different ASs and their underlying technologies for the novice as well as the experienced person. The intent is to broaden the background of database people for them to be able to better appreciate application requirements and scenarios.

CONTENTS

Introduction: TP Monitors, Evolution of Application Environments and Requirements, Distributed Computing Models, Dynamic Web, Business and Presentation Logic Encapsulation

Underlying Technologies: Java 2 Enterprise Edition (J2EE), Common Object Request Broker Architecture (CORBA), Enterprise JavaBeans (EJBs), Java ServerPages (JSPs), Java Transaction API & Service (JTA & JTS), Java Messaging Service (JMS), Java Database Connectivity (JDBC), Internet Inter-ORB Protocol (IIOP), Simple Object Access Protocol (SOAP), Web Services

Application Servers: IBM WebSphere, BEA WebLogic, Oracle9i Application Server, Sun ONE Application Server (iPlanet), Microsoft .NET

Functionality Attributes: Availability, Scalability, High Performance, Load Balancing, Embeddability, Portability, Cloning, Failover

Benchmarks: Nile, Trade, ECPerf

Complementary Functionality Areas: Commerce, Business to Business Collaboration, Personalization, Transcoding, Internationalization, Caching, Directory Services, Visual/Integrated Software Development Environments, Transactional Messaging and Queuing, Edge Servers

Application Case Studies: eBay, Schwab

WHO SHOULD ATTEND

This tutorial is targeted at academic/industrial researchers, systems designers/implementers and practitioners who wish to obtain a good understanding of the state-of-the-art in application servers, especially those based on J2EE, and their associated technologies.

ABOUT THE INSTRUCTOR

C. Mohan (Ph.D. 1981, UT-Austin) was named an IBM Fellow in 1997 for being recognized worldwide as a leading innovator in transaction management. He is the primary inventor of the ARIES family of recovery and locking methods, and the industry-standard Presumed Abort commit protocol. He received the 1996 ACM SIGMOD Innovations Award. At the 1999 VLDB Conference, he was honored with the 10 Year Best Paper Award for the widespread commercial and research impact of his work on ARIES. Mohan, who is an inventor on 33 patents, works very closely with numerous IBM product groups. His research results are implemented in numerous IBM and non-IBM prototypes and products like DB2, MQSeries, Lotus Domino, S/390 Parallel Sysplex and SQLServer. Currently, Mohan is a member of IBM's Application Integration Middleware Architecture Board and is working on next generation messaging technologies and database caching in the context of WebSphere and DB2.

TUTORIAL 5: THURSDAY, 22 AUGUST 2002, 11:00-12:30 & 14:00-15:30

Querying and Mining Data Streams:
You Only Get One Look

(PDF Presentation Slides - 456K)

Mino Garofalakis photo

Johannes Gehrke photo

Rajeev Rastogi photo

Minos

Garofalakis

Lucent Technologies - Bell Labs, U.S.A.

Johannes

Gehrke

Cornell University,

U.S.A.

Rajeev

Rastogi

Lucent Technologies - Bell Labs, U.S.A.

OBJECTIVES

Continuous data streams arise naturally, for example, in the network installations of large Telecom and Internet service providers where detailed usage information from different parts of the network needs to be continuously collected and analyzed for interesting trends. This tutorial will provide a comprehensive and clear overview of the key research results surrounding data stream processing at this point in time.

CONTENTS

Our discussion will be structured as follows.

Introduction: Basic stream-processing models and architectures; motivating applications.

Basic Stream Summarization Algorithms: Samples, quantiles/histograms, sketches, wavelets over streaming data.

Processing Queries on Streams: Using sketches for self-joins, binary joins, and complex joins over data streams; estimating correlated aggregates; using histogram and wavelet synopses for approximate-query processing.

Mining High-speed Data Streams: Single-pass algorithms for rule discovery, clustering, and decision-tree construction over streams

Advanced Topics and Future Research Directions: Hot-list maintenance; distinct-value estimation; multi-dimensional synopses; content-based filtering of streaming XML documents.

WHO SHOULD ATTEND

This tutorial is targeted at researchers and practitioners who want to obtain a solid understanding of the state-of-the-art in stream query processing and analysis.

ABOUT THE INSTRUCTORS

Minos Garofalakis (Ph.D. 1998, UW-Madison) is a Member of Technical Staff at Bell Labs. His research interests include data reduction and mining, data streaming, approximate queries, and XML.

Johannes Gehrke (Ph.D. 1999, UW-Madison) is an Assistant Professor at Cornell University. His research interests include data mining, database systems, and ubiquitous computing.

Rajeev Rastogi (Ph.D. 1993, UT-Austin) is a Department Director at Bell Labs. His research interests include network management, database systems, and knowledge discovery.

TUTORIAL 6: THURSDAY, 22 AUGUST 2002, 14:00-15:30 & 16:00-17:30

eBusiness Architectures and Standards

(PDF Presentation Slides - 1.6MB)

Anil Nori photo

Anil K. Nori, Founder and CTO
Asera Inc., U.S.A.

OBJECTIVES

eBusiness systems and solutions enable enterprises to implement their business processes and turn the enterprises into real-time enterprises, where customers, partners, suppliers and employees share data and processes in real time. The deployment of eBusiness is made real with the advances in the Internet and eBusiness technologies and standards. However, there are numerous standards, sometimes overlapping, often causing confusion amongst developers and users. In this tutorial we will study different eBusiness technologies, standards and their applicability.

CONTENTS

1.

History of eBusiness

5.

eBusiness Standards

2.

eBusiness Requirements

a. Business protocols: ebXML, RosettaNet, etc.

3.

eBusiness Applications

b. XML Schema

4.

eBusiness Architectures

c. Messaging, Web Services: WSDL, SOAP, UDDI

a. Services Oriented, Process Based architectures

d. Workflow: BPML, XLANG, WSFL, etc.

b. Integration framework

e. Implementation Standards: J2EE vs. .NET

c. Business objects (e.g., XML schema)

f. Business Process Design

d. Business Process Management

g. Security standards (e.g. SAML)

e. Access Management

h. Solutions verticals (e.g. Chemicals, Financials)

f. Portal and Presentation

6.

Case Studies

g. Web Services

7.

Open issues

WHO SHOULD ATTEND

This tutorial is very practical and systems oriented. The tutorials is intended for database/middleware researchers, implementers, application developers and end users who want to gain a comprehensive understanding of eBusiness architectures, technologies (e.g. workflow, rules) and standards and gain appreciation for their applicability in building enterprise solutions.

ABOUT THE INSTRUCTOR

Anil Nori has considerable experience in building complex database and eBusiness systems. He is co-founder and CTO of Asera Inc., which provides eBusiness solutions supported by a platform and tools for development, deployment and management of enterprise business processes. Prior to Asera, Anil worked as a key database server architect at Oracle and Digital Equipment Corporation. He also worked at Computer Corporation of America. Anil has published papers and presented tutorials at SIGMOD, VLDB, ICDE and other Industry conferences.

TUTORIAL 7: FRIDAY, 23 AUGUST 2002, 09:00-10:30 & 11:00-13:00

Sensor Data Mining: Similarity Search and Pattern Analysis

(PDF Presentation Slides - 2.5MB)

Christos Faloutsos photo

Christos Faloutsos
Carnegie Mellon University, U.S.A.

OBJECTIVES

How can we find patterns in a sequence of sensor measurements (e.g., a sequence of temperatures or water-pollutant measurements)? How can we compress it? What are the major tools for forecasting and outlier detection? The objective of this tutorial is to provide a concise and intuitive overview of the most important tools, that can help us find patterns in sensor sequences. Sensor data analysis becomes of increasingly high importance, due to the decreasing cost of hardware and the increasing on-sensor processing abilities. We review the state of the art in three related fields: (a) fast similarity search for time sequences, (b) linear forecasting with the traditional AR (autoregressive) and ARIMA methodologies and (c) non-linear forecasting, for chaotic/self-similar time sequences, using lag-plots and fractals. The emphasis of the tutorial is to give the intuition behind these powerful tools, which is usually lost in the technical literature, as well as to give case studies that illustrate their practical use.

CONTENTS

Similarity Search

why we need similarity search

distance functions (Euclidean, LP norms, time-warping)

fast searching (R-trees, M-trees)

feature extraction (DFT, Wavelets, SVD, FastMap)

Linear Forecasting

main idea behind linear forecasting

AR methodology

multivariate regression

Recursive Least Squares

de-trending; periodicities

Non-linear/chaotic forecasting

main idea: lag-plots

'fractals' and 'fractal dimensions'

definition and intuition

algorithms for fast computation

case studies

WHO SHOULD ATTEND

Researchers that want to get up to speed with the major tools in time sequence analysis. Also, practitioners who want a concise, intuitive overview of the state of the art.

ABOUT THE INSTRUCTOR

Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), three "best paper'" awards (SIGMOD 94, VLDB 97, KDD01-runner-up), and four teaching awards. He is a member of the executive committee of SIGKDD; he has published over 100 refereed articles, one monograph, and holds four patents. His research interests include data mining, fractals, indexing methods for multimedia and text data bases, and data base performance.