30th International Conference on
Very Large Data Bases
Royal York Hotel
29 August - 3 September 2004
Toronto, Canada






Tutorial 1: Database Architectures for New Hardware

[Presentation PDF] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]

Anastassia Ailamaki, Carnegie Mellon University

Room 4, Tuesday 11:00-12:30 and 2:00-3:30

Thirty years ago, DBMS stored data on disks and cached recently used data in main memory buffer pools, while designers worried about improving I/O performance and maximizing main memory utilization. Today, however, databases live in multi-level memory hierarchies that include disks, main memories, and several levels of processor caches. Four (often correlated) factors have shifted the performance bottleneck of data-intensive commercial workloads from I/O to the processor and memory subsystem. First, storage systems are becoming faster and more intelligent (now disks come complete with their own processors and caches). Second, modern database storage managers aggressively improve locality through clustering, hide I/O latencies using prefetching, and parallelize disk accesses using data striping. Third, main memories have become much larger and often hold the application's working set. Finally, the increasing memory/processor speed gap has pronounced the importance of processor caches to database performance.
This tutorial will first survey techniques proposed in the computer architecture and database literature on understanding and evaluating database application performance on modern hardware. We will present approaches and methodologies used to produce time breakdowns when executing database workloads on modern processors. Then, we will survey techniques proposed to alleviate the problem, with major emphasis on data placement and prefetching techniques and their evaluation. Finally, we will discuss open problems and future directions: Is it only the memory subsystem database software architects should worry about? How important are other decisions processors make to database workload behavior? Given the emerging multi-threaded, multi-processor computers with modular, deep cache hierarchies, how feasible is it to create database systems that will adapt to their environment and will automatically take full advantage of the underlying hierarchy?

Anastassia Ailamaki, CMU
Anastassia Ailamaki received a B.Sc. degree in Computer Engineering from the Polytechnic School of the University of Patra, Greece, M.Sc. degrees from the Technical University of Crete, Greece nd from the University of Rochester, NY, and a Ph.D. degree in Computer Science from the University of Wisconsin-Madison. In 2001, she joined the Computer Science Department at Carnegie Mellon University as an Assistant Professor. Her research interests are in the broad area of database systems and applications, with emphasis on database system behavior on modern processor hardware and disks. Her projects at Carnegie Mellon (including Staged Database Systems, Cache-Resident Data Bases, and the Fates Storage Manager), aim at building systems to strengthen the interaction between the database software and the underlying hardware and I/O devices. Her other research interests include automated database design for scientific databases, storage device modeling, and internet querying. She has received three best-paper awards (VLDB 2001, Performance 2002, and ICDE 2004), an NSF CAREER award (2002), and IBM Faculty Partnership awards in 2001, 2002, and 2003. She is a member of IEEE and ACM.

Tutorial 2: Security of Shared Data in Large Systems: State of the Art and Research Directions

[Presentation PPS] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]

Arnon Rosenthal and Marianne Winslett

Room 4, Tuesday 4:00-5:30 and Wednesday 11:00-12:30

Security is increasingly recognized as a key impediment to sharing data in enterprise systems, virtual enterprises, and the semantic web. Yet the topic has not been a focus for mainstream database research, industrial progress in data security has been slow, and (too) much security enforcement is in application code, or else is coarse grained and insensitive to data contents. Today the database research community is in an excellent position to make significant improvements in the way people think about security policies, due to the community's experience with declarative and logic-based specifications, automated compilation and physical design, and both semantic and efficiency issues for federated systems. These strengths provide a foundation for improving both theory and practice. This tutorial aims to enlighten the VLDB community about the state of the art in data security, especially for enterprise or larger systems, and to engage the community's interest in improving the current state of affairs by showing how database researchers can help improve the state of the art in data security.

Arnie Rosenthal, MITRE
Arnie Rosenthal is a Principal Scientist at MITRE. He has broad interests in problems that arise when data is shared between communities, including a long-standing interest in the security issues that arise in data warehouses, federated databases, and enterprise information systems. He has also had a first-hand look at many security problems that arise in large government and military organizations.

Marianne Winslett, University of Illinois
Marianne Winslett has been a professor at the University of Illinois since 1987. She started working on database security issues in the early 1990s, focusing on semantic issues in MLS databases. Her interests soon shifted to issues of trust management for data on the web. Trust negotiation is her main current research focus.

Tutorial 3: Self-Managing Technology in Database Management Systems

Part 1: [Presentation PPS] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]
Part 2: [Presentation PPS] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]
Part 3: [Presentation PPS] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]
Part 4: [Presentation PPS] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]

Surajit Chaudhuri, Benoit Dageville, Guy Lohman

Room 4, Wednesday 2:00-4:00 and 4:30-5:30

The rising cost of labor relative to hardware and software means that the total cost of ownership of information technology is increasingly dominated by people costs. In response, all major providers of information technology are attempting to make their products more self-managing. In this tutorial, we will introduce the core concepts of self-managing technology and discuss their implications for database management systems. We will review the state of the art of self-managing technology for the IBM DB2, Microsoft SQL Server, and Oracle products, and describe the wealth of research topics remaining to be solved.

Surajit Chaudhuri, Microsoft Research
Surajit Chaudhuri leads the Data Management and Exploration Group at Microsoft Research http://research.microsoft.com/dmx. In 1996, Surajit started the AutoAdmin project at Microsoft Research to address the challenges in building self-tuning database systems. This project developed novel automated index tuning technology that shipped with SQL Server 7.0 in 1998. The AutoAdmin project has since then continued to develop the self-tuning and self-manageability technology further in collaboration with the Microsoft SQL Server product team. Surajit's other project is Data Exploration, which looks at the problem of querying and discovery of information in a flexible manner information that spans text as well as relational data. Surajit is also interested in the problems of data cleaning and integration. Surajit did his Ph.D. from Stanford University in 1991 and worked at Hewlett-Packard Laboratories, Palo Alto from 1991-1995. He has published widely in major database conferences. http://research.microsoft.com/users/surajitc

Benoit Dageville, Oracle Corp.
Benoit Dageville is a consulting member in the Oracle Database Server Technologies division at Oracle Corporation in Redwood Shores, California. His main areas of expertise include parallel query processing, ETL (Extract Transform and Load), large Data Warehouse benchmarks (e.g. TPC-H), SQL execution and optimization. Since 1999, he is one of the lead architects of the self-managing database initiative at Oracle. Major features resulting from this effort include the Automatic SQL Memory Management (Oracle9i), Automatic Workload Repository, Automatic Database Diagnostic Monitor and Automatic SQL Tuning (Oracle10g). Dr Benoit Dageville graduated from the University of Paris VI, France, in 1995 with a Ph.D. degree in computer science, specializing in Parallel Database Management Systems under the supervision of Dr. Patrick Valduriez. His research and industrial work resulted in several refereed papers in international conferences and journals.

Guy M. Lohman, IBM Almaden
Guy M. Lohman is Manager of Advanced Optimization in the Advanced Database Solutions Department at IBM Research Division's Almaden Research Center in San Jose, California, and has 22 years of experience in relational query optimization at IBM. He is the architect of the Optimizer of the DB2 Universal Data Base (UDB) for Linux, Unix, and Windows, and was responsible for its development from 1992 to 1997. Prior to that, he was responsible for the optimizers of the Starburst extensible object-relational and R* distributed prototype DBMSs. More recently, Dr. Lohman co-invented and designed the DB2 Index Advisor (predecessor to today's Design Advisor), and in 2000 co-founded the DB2 Autonomic Computing project (formerly known as SMART -- Self-Managing And Resource Tuning), now part of IBM's company-wide Autonomic Computing initiative. In 2002, Dr. Lohman was elected to the IBM Academy of Technology. Dr. Lohman received his Ph.D. in Operations Research in 1976 from Cornell University. He is the author of over 40 papers in the refereed literature and the holder of 13 U.S. patents. His current research interests involve query optimization and self-managing database systems.

Tutorial 4: Architectures and Algorithms for Internet-Scale (P2P) Data Management

[Presentation PPS] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]

Joseph M. Hellerstein

Room 4, Thursday 2:00-3:30 and Thursday 4:00-5:30

The database community prides itself on scalable data management solutions. In recent years, a new set of scalability challenges have arisen in the context of peer-to-peer (p2p) systems on the Internet, in which the scaling metric is the number of participating computers, rather than the number of bytes stored. The best-known application of p2p technology to date has been file sharing, but there are compelling new application agendas in Internet monitoring, content distribution, distributed storage, multi-user games and next-generation Internet routing. The energy behind p2p technology has led to a renaissance in the distributed algorithms and distributed systems communities, much of which directly addresses issues in massively distributed data management. Moreover, many of these ideas have applicability beyond the "pure p2p" context of Internet end-user systems, with ramifications for any large-scale distributed system in which the scale of the system makes traditional administrative models untenable. Internet-scale systems present numerous unique technical challenges, including steady-state "churn" (nodes joining and leaving), the need to evolve and scale without reconfiguration, an absence of ongoing system administration, and adversarial participants in the processing. In this tutorial, we will focus on key data management building blocks including network indirection architectures, persistence models, network embeddings of computations, resource management, and security/trust challenges. We will also discuss motivations for the use of these technologies. We will ground the presentation in experiences from a set of deployed systems, and present open challenges that have arisen in this context.

Joseph M. Hellerstein, University of California, Berkeley and Intel Research Berkeley
Joseph M. Hellerstein is a Professor of Computer Science at the University of California, Berkeley, and is the Director of Intel Research, Berkeley. He is an Alfred P. Sloan Research Fellow, and a recipient of multiple awards, including ACM-SIGMOD's "Test of Time" award for his first published paper, NSF CAREER, NASA New Investigator, Okawa Foundation Fellowship, and IBM's Best Paper in Computer Science. In 1999, MIT's Technology Review named him one of the top 100 young technology innovators worldwide in their inaugural "TR100" list. Hellerstein's research focuses on data management and movement, including database systems, sensor networks, peer-to-peer and distributed systems. Prior to his position at Intel Research, Hellerstein was a co-founder of Cohera Corporation (now part of PeopleSoft), where he served as Chief Scientist from 1998-2001. Key ideas from his research have been incorporated into commercial and open-source database systems including IBM's DB2 and Informix, PeopleSoft's Catalog Management, and the open-source PostgreSQL system. Hellerstein currently serves on the technical advisory boards of a number of software companies, and has served as a member of the advisory boards of ACM SIGMOD and Ars Digita University. Hellerstein received his Ph.D. at the University of Wisconsin, Madison, a masters degree from UC Berkeley, and a bachelor's degree from Harvard College. He spent a pre-doctoral internship at IBM Almaden Research Center, and a post-doctoral internship at the Hebrew University in Jerusalem.

Tutorial 5. The Continued Saga of DB-IR Integration

[Presentation PDF] [Handout (2 slides/page)] [Notes (3 slides/page with space for note)]

Ricardo Baeza-Yates and Mariano Consens

Confederation 5 & 6, Friday 9:00-10:30 and 11:00-12:30

The world of data has been developed from two main points of view: the structured relational data model and the unstructured text model. The two distinct cultures of database and information retrieval now have a natural meeting place in the Web with its semi-structured XML model. As web-style searching becomes the ubiquitous tool, the need for integrating these two viewpoints becomes even more important. In this tutorial we explore the differences, the problems and the techniques for DB-IR integration for a range of applications. The tutorial will provide an overview of the different approaches put forward by the IR and DB communities and survey the DB-IR integration efforts. Both earlier proposals as well as recent ones (in the context of XML in particular) will be discussed. A variety of application scenarios for DB-IR integration will be covered. The objective of this tutorial is to provide an overview of the issues and approaches developed for the integration of database and information retrieval systems. The target audience of this tutorial includes researchers in database systems, as well as developers of Web and database/information retrieval applications.

Ricardo Baeza-Yates, University of Chile
Ricardo Baeza-Yates is a professor at the Computer Science Department of the University of Chile, where he was the chair between 1993-95. He is also the director of the Center for Web Research, a project funded by the Millenium Scientific Initiative of the Ministry of Planning. He received the bachelor degree in Computer Science (CS) in 1983 from the University of Chile. Later, in 1985, he received also the M.Sc. in computer science and the professional title in electrical engineering (EE). One year later, he obtained M.Eng. in EE from the same university. He received his Ph.D. in CS from the University of Waterloo, Canada, in 1989, doing a six months post-doctoral position the same year. His research interests include information retrieval, algorithms, and information visualization. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley, as well as co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992, between other publications in journals published by ACM, IEEE, SIAM, etc. Ricardo received, in 1993, the Organization of American States award for young researchers in exact sciences. In 1994 he received the award to the best engineering research in the last 4 years from the Institute of Engineers of Chile, and was invited by the U.S.A. Presidential Office for a one month scientific tour in that country. In 1996, he won a scholarship from the Education and Science Ministry of Spain to have a sabbatical year at the Polytechnic Univ. of Catalunya. In 1997 with two Brazilian colleagues obtained the COMPAQ prize to the best Brazilian research article in computer science. As a Ph.D. student he won several scholarships including the Ontario Graduate scholarship, the Institute for Computer Research scholarship for graduate studies, Information Technology and Research Centre graduate scholarship, the Univ. of Waterloo graduate student award, and the Department of Computer Science Fellowship. Ricardo's professional activities include Presidency of the Chilean Computer Science Society (SCCC) between 1992-1995 and 1997-1998. From 1998-2000 he was in charge of the IEEE-CS chapter in Chile and has been involved in the South American ACM Programming Contest since 1998. He is currently the president of CLEI, a Latin American association of CS departments; and coordinates the Iberoamerican cooperation program in Electronics and Informatics. He was recently elected to the IEEE CS Board of Governors for the period 2002-04. In 2002 he was appointed to the Chilean Academy of Sciences, being the first person from computer science to achieve this position in Chile.

Mariano Consens, University of Toronto
Mariano Consens is a faculty member in Information Engineering at the MIE Department, University of Toronto, which he joined in 2003. Before that, he was research faculty at the School of Computer Science, University of Waterloo, from 1994 to 1999. He received his PhD and MSc degrees in Computer Science from the University of Toronto. He also holds a Computer Systems Engineer degree from the Universidad de la Republica, Uruguay. Mariano's research interests are in the areas of Data Management Systems and the Web, with a current focus on XML searching, autonomic systems and pervasive computing. He has over 20 publications and two patents, including journal publications selected from best conference papers. In addition to his academic positions, he has been active in the software industry as a founder and Director of Freedom Intelligence (a query engine provider), as the CTO of Classwave Wireless (a Bluetooth software infrastructure company) and as a technology advisor for Xign (an electronic payment systems supplier), OpenText (an early web search engine turned knowledge management software vendor) and others.