Tutorials









Tutorials

VLDB 2014 has accepted five tutorials listed as below.

Systems for Big Graphs (3 hours)
Arijit Khan, Sameh Elnikety

Uncertain Entity Resolution (1.5 hours)
Avigdor Gal

Knowledge Bases in the Age of Big Data Analytics (3 hours)
Fabian Suchanek, Gerhard Weikum

Causality and Explanations in Databases (1.5 hours)
Alexandra Meliou, Sudeepa Roy, Dan Suciu

Enterprise Search in the Big Data Era (3 hours)
Yunyao Li, Ziyang Liu, Huaiyu Zhu.



Systems for Big Graphs

Arijit Khan

        Sameh Elnikety

Abstract
Large-scale, highly-interconnected networks pervade our society and the natural world around us. Graphs represent such complicated structures and schema-less data including the World Wide Web, social networks, knowledge graphs, genome and scientific databases, e-commerce, medical and government records. Graph processing poses interesting system challenges: A graph models entities and their relationships, which are usually irregular and unstructured; and therefore, the computation and data access patterns exhibit poor locality. Although several disk-based graph-indexing techniques have been proposed for specific graph operations, they still cannot provide the level of efficient random access required by graph computation. On the other hand, the scale of graph data easily overwhelms the main memory and computation resources on commodity servers. Today's big-graphs consist of millions of vertices and billions of edges. In these cases, achieving low latency and high throughput requires partitioning the graph and processing the graph data in parallel across a cluster of servers. However, the software and hardware advances that have worked well for developing parallel databases and scientific applications are not necessarily effective for graph problems. Hence, the last few years has seen an unprecedented interest in building systems for big-graphs by various communities including databases, systems, semantic web, and machine learning. In this tutorial, we discuss the design of these emerging systems for processing of big-graphs, key features of distributed graph algorithms, as well as graph partitioning and workload balancing techniques. We discuss the current challenges and highlight some future research directions.

Presenter
Arijit Khan and Sameh Elnikety

Bio
Arijit Khan is a post-doctorate researcher in the Systems group at ETH Zurich. His research interests span in the area of big-data, big-graphs, and graph systems. He received his PhD from the Department of Computer Science, University of California, Santa Barbara. Arijit is the recipient of the prestigious IBM PhD Fellowship in 2012-13. He co-presented a tutorial on emerging queries over linked data at ICDE 2012.

Sameh Elnikety is a researcher at Microsoft Research in Redmond, Washington. He received his Ph.D. from the Swiss Federal Institute of Technology (EPFL) in Lausanne, Switzerland , and M.S. from Rice University in Houston, Texas. His research interests include distributed server systems, and database systems. Sameh’s work on database replication received the best paper award at Eurosys 2007.




Uncertain Entity Resolution

Avigdor Gal

Abstract
Entity resolution is a fundamental problem in data integration dealing with the combination of data from different sources to a unified view of the data. Entity resolution is inherently an uncertain process because the decision to map a set of records to the same entity cannot be made with certainty unless these are identical in all of their attributes or have a common key. In the light of recent advancement in data accumulation, management, and analytics landscape (known as big data) the tutorial re-evaluates the entity resolution process and in particular looks at best ways to handle data veracity. The tutorial ties entity resolution with recent advances in probabilistic database research, focusing on sources of uncertainty in the entity resolution process. We shall discuss which types of uncertainties have been handled in the literature and suggest new methods for coping with various types of uncertainties, some of which are presented as future challenges.

Presenter
Avigdor Gal

Bio
Avigdor Gal is an associate professor at Faculty of Industrial Engineering & Management, Technion. He is an expert on information systems. His research focuses on effective methods of integrating data from multiple and diverse sources, which affect the way businesses and consumers seek information over the Internet. His current work zeroes in on data integration — the task of providing communication between databases, and connecting such communication to real-world concepts. Another line of research involves the identification of complex events such as flu epidemics, biological attacks, and breaches in computer security, and its application to disaster and crisis management. He has applied his research to European and American projects in government, eHealth, and the integration of business documents. Prof. Gal has published more than 100 papers in leading professional journals (e.g. Journal of the ACM (JACM), ACM Transactions on Database Systems (TODS), IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions on Internet Technology (TOIT), and the VLDB Journal) and conferences (ICDE, BPM, DEBS, ER, CoopIS) and books (Schema Matching and Mapping). He authored the book Uncertain schema Matching in 2011, serves in various editorial capacities for periodicals including the Journal on Data Semantics (JoDS), Encyclopedia of Database Systems and Computing, and has helped organize professional workshops and conferences nearly every year since 1998. He has won the IBM Faculty Award each year from 2002-2004, several Technion awards for teaching, the 2011-13 Technion-Microsoft Electronic Commerce Research Award, and the 2012 Yanai Award for Excellence in Academic Education, and others.




Knowledge Bases in the Age of Big Data Analytics

fabiansuchanek

        Gerhard Weikum

Abstract
The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. They usually contain millions of entities and hundreds of millions of facts about them. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications. Particular emphasis will be on the twofold role of knowledge bases for big-data analytics: using scalable distributed algorithms for harvesting knowledge from Web and text sources, and leveraging entity-centric knowledge for deeper interpretation of and better intelligence with big data.

Presenter
Fabian Suchanek and Gerhard Weikum

Bio
Fabian M. Suchanek is an associate professor at the Télécom ParisTech University in Paris, France. He obtained his PhD at the Max Planck Institute for Informatics in 2008, which earned him an honorable mention for the ACM SIGMOD Jim Gray Dissertation Award. Later he was a postdoc at Microsoft Research Search Labs in Silicon Valley (in the group of Rakesh Agrawal) and in the WebDam team at INRIA Saclay/France (in the group of Serge Abiteboul), and led an independent Otto Hahn Research Group, funded by the Max Planck Society. Fabian is the main architect of the YAGO ontology, one of the largest public knowledge bases.

Gerhard Weikum is a scientific director at the Max Planck Institute for Informatics in Saarbruecken, Germany, where he is leading the department on databases and information systems. He co-authored a comprehensive textbook on transactional systems, received the VLDB 10-Year Award for his work on automatic DB tuning, and is one of the creators of the YAGO knowledge base. Gerhard is an ACM Fellow, a member of several scientific academies in Germany and Europe, and a recipient of a Google Focused Research Award, an ACM SIGMOD Contributions Award, and an ERC Synergy Grant.




Causality and Explanations in Databases

Alexandra Meliou

        Sudeepa Roy

        Dan Suciu

Abstract
With the surge in the availability of information, there is a great demand for tools that assist users in understanding their data. While today's exploration tools rely mostly on data visualization, users often want to go deeper and understand the underlying causes of a particular observation. This tutorial surveys research on causality and explanation for data-oriented applications. Over the last few years there have been several efforts in the Database and AI communities to develop general techniques to model causes for observations on data, starting with Judea Pearl's seminal book on `causality'. Causality has been formalized both for AI applications and for database queries, and formal definitions of `explanations' have also been proposed in the database literature. In this tutorial we will review and summarize the research thus far into causality and explanation in the database and AI communities, giving researchers a snapshot of the current state of the art on this topic, and propose directions for future research. We will cover both the theory of causality/explanation and some applications. We also discuss the connections with other topics in database research like provenance, deletion propagation, why-not queries, and OLAP techniques. The tutorial will be aimed at a broad audience in the database community including active researchers in data management, graduate students seeking a new research topic, as well as practitioners from the industry to preview a plausible future perspective in data analysis tools.

Presenter
Alexandra Meliou, Sudeepa Roy and Dan Suciu

Bio
Alexandra Meliou is an Assistant Professor in the School of Computer Science at the University of Massachusetts. She received her PhD and Masters degrees from the University of California Berkeley in 2009 and 2005 respectively. She completed her postdoctoral work in 2012 at the University of Washington. She has made contributions to the areas of provenance, causality in databases, data cleaning, and sensor networks. She received a 2008 Siebel scholarship, a 2012 Sigmod best demo award, and a 2013 Google faculty award.

Sudeepa Roy is a Postdoctoral Researcher in Computer Science at the University of Washington. Her current research focuses on theory and applications of causality/explanations in databases. She has also worked on provenance in databases and workflows, probabilistic databases, information extraction, and crowd sourcing. During her doctoral studies at the University of Pennsylvania, she was a recipient of the Google PhD Fellowship in structured data.

Dan Suciu is a Professor in Computer Science at the University of Washington. He made contributions to semistructured data, data privacy, probabilistic databases, and causality/explanations in databases. He is a Fellow of the ACM, holds twelve US patents, received the best paper award in SIGMOD 2000 and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE 2013, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. He has given past tutorials in VLDB and SIGMOD (on semistructured data and XML, and on probabilistic database).




Enterprise Search in the Big Data Era

Yunyao Li

        Ziyang Liu

        Huaiyu Zhu

Abstract
Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and review the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.

Presenter
Yunyao Li, Ziyang Liu and Huaiyu Zhu

Bio
Yunyao Li is a researcher at IBM Research—Almaden. She has broad interests across multiple disciplines, most notably databases, natural language processing, human-computer interaction, information retrieval, and machine learning. Her current research focuses on enterprise search and scalable declarative text analytics for enterprise applications. She is the owner of several key components in the search engine that is currently powering IBM intranet search. She received her PhD degree in Computer Science and Engineering from the University of Michigan, Ann Arbor in 2007.

Ziyang Liu is a researcher at the Data Management department at NEC Laboratories America. His research interests span several topics in data management, including efficient and iterative big data analytics, data pricing, multitenant databases, data usability and effectively searching structured data with keywords. He got his Ph.D. from the School of Computing, Informatics, and Decision Systems Engineering at Arizona State University in 2011. He also received B.S. degree in computer engineering from Harbin Institute of Technology, China, in 2006.

Huaiyu Zhu is with IBM Research—Almaden. He received his PhD degree in Computational Mathematics and Statistics from Liverpool University. His research interest includes statistical and machine learning techniques in data mining applications, especially in text analytics and large scale enterprise applications. In the past several years his main research focus was on enterprise search.