VLDB 2019: Tutorials
Data Lake Management: Challenges and Opportunities Fatemeh Nargesian (University of Toronto), Erkang Zhu (University of Toronto), Renée J. Miller (Northeastern University), Ken Pu (UOIT), and Patricia C. Arocena (TD Bank Group) Slides The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and meta-data management.
Combating Fake News: A Data Management and Mining Perspective Laks V.S. Lakshmanan (The University of British Columbia), Michael Simpson (The University of British Columbia), and Saravanan Thirumuruganathan (QCRI, HBKU) Slides Fake news is a major threat to global democracy resulting in diminished trust in government, journalism and civil society. The public popularity of social media and social networks has caused a contagion of fake news where conspiracy theories, disinformation and extreme views flourish. Detection and mitigation of fake news is one of the fundamental problems of our times and has attracted widespread attention. While fact checking websites such as snopes, politifact and major companies such as Google, Facebook, and Twitter have taken preliminary steps towards addressing fake news, much more remains to be done. As an interdisciplinary topic, various facets of fake news have been studied by communities as diverse as machine learning, databases, journalism, political science and many more. The objective of this tutorial is two-fold. First, we wish to familiarize the database community with the efforts by other communities on combating fake news. We provide a panoramic view of the state-of-the-art of research on various aspects including detection, propagation, mitigation, and intervention of fake news. Next, we provide a concise and intuitive summary of prior research by the database community and discuss how it could be used to counteract fake news. The tutorial covers research from areas such as data integration, truth discovery and fusion, probabilistic databases, knowledge graphs and crowdsourcing from the lens of fake news. Effective tools for addressing fake news could only be built by leveraging the synergistic relationship between database and other research communities. We hope that our tutorial provides an impetus towards such synthesis of ideas and the creation of new ones.
Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems Jiaheng Lu (University of Helsinki), Yuxing Chen (University of Helsinki), Herodotos Herodotou (Cyprus University of Technology), and Shivnath Babu (Duke University) Slides Database and big data analytics systems such as Hadoop and Spark have a large number of configuration parameters that control memory distribution, I/O optimization, parallelism, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators struggle to understand and tune them to achieve good performance. In this tutorial, we review existing approaches on automatic parameter tuning for databases, Hadoop, and Spark, which we classify into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We describe the foundations of different automatic parameter tuning algorithms and present pros and cons of each approach.
The Ever Evolving Online Labor Market: Overview, Challenges and Opportunities Sihem Amer-Yahia (CNRS, Univ. Grenoble Alpes) and Senjuti Basu Roy (NJIT) Slides The goal of this tutorial is to make the audience aware of various discipline-specific research activities that could be characterized to be part of online labor markets and advocate for a unified framework that is interdisciplinary in nature and requires convergence of different research disciplines. We will discuss how such a framework could bring transformative effect on the nexus of humans, technology, and the future of work.
Machine Learning Meets Big Spatial Data Ibrahim Sabek (University of Minnesota) and Mohamed F. Mokbel (Qatar Computing Research Institute, Hamad bin Khalifa University) Slides The proliferation in amounts of generated data has propelled the rise of scalable machine learning solutions to efficiently analyze and extract useful insights from such data. Meanwhile, spatial data has become ubiquitous, e.g., GPS data, with increasingly sheer sizes in recent years. The applications of big spatial data span a wide spectrum of interests including tracking infectious disease, climate change simulation, drug addiction, among others. Consequently, major research efforts are exerted to support efficient analysis and intelligence inside these applications by either providing spatial extensions to existing machine learning solutions or building new solutions from scratch. In this 90-minutes tutorial, we comprehensively review the state-of-the-art work in the intersection of machine learning and big spatial data. We cover existing research efforts and challenges in three major areas of machine learning, namely, data analysis, deep learning and statistical inference, as well as two advanced spatial machine learning tasks, namely, spatial features extraction and spatial sampling. We also highlight open problems and challenges for future research in this area.
TextCube: Automated Construction and Multidimensional Exploration Yu Meng (University of Illinois at Urbana-Champaign), Jiaxin Huang (University of Illinois Urbana-Champaign), Jingbo Shang (University of Illinois at Urbana-Champaign), and Jiawei Han (University of Illinois at Urbana-Champaign) Slides Today's society is immersed in a wealth of text data, ranging from news articles, to social media, research literature, medical records, and corporate reports. A grand challenge of data science and engineering is to develop effective and scalable methods to extract structures and knowledge from massive text data to satisfy diverse applications, without extensive, corpus-specific human annotations. In this tutorial, we show that TextCube provides a critical information organization structure that will satisfy such an information need. We overview a set of recently developed data-driven methods that facilitate automated construction of TextCubes from massive, domain-specific text corpora, and show that TextCubes so constructed will enhance text exploration and analysis for various applications. We focus on new TextCube construction methods that are scalable, weakly-supervised, domain-independent, language-agnostic, and effective (i.e., generating quality TextCubes from large corpora of various domains). We will demonstrate with real datasets (including news articles, scientific publications, and product reviews) on how TextCubes can be constructed to assist multidimensional analysis of massive text corpora.
Personal Database Security and Trusted Execution Environments: A Tutorial at the Crossroads Nicolas Anciaux (Inria Saclay, U. Versailles SaintQuentin, Université Paris-Saclay), Luc Bouganim (Inria Saclay, U. Versailles SaintQuentin, Université Paris-Saclay), Philippe Pucheral (U. Versailles Saint-Quentin, Inria Saclay, Université Paris-Saclay), Iulian Sandu Popa (U. Versailles Saint-Quentin, Inria Saclay, Université Paris-Saclay), and Guillaume Scerri (U. Versailles Saint-Quentin, Inria Saclay, Université Paris-Saclay) Slides Smart disclosure initiatives and new regulations such as GDPR in the EU increase the interest for Personal Data Management Systems (PDMS) being provided to individuals to preserve their entire digital life. Consequently, the thorny issue of data security becomes more and more prominent, but highly differs from traditional privacy issues in outsourced corporate databases. Concurrently, the emergence of Trusted Execution Environments (TEE) changes the game in privacy-preserving data management with novel security models. This tutorial offers a global perspective of the current state of work at the confluence of these two rapidly growing areas. The goal is threefold: (1) review and categorize PDMS solutions and identify existing privacy threats and countermeasures; (2) review new security models capitalizing on TEEs and related privacy-preserving data management solutions relevant to the personal context; (3) discuss new challenges at the intersection of PDMS security and TEE-based data management.