VLDB 2019: Industrial Track Papers
The number assigned to each industrial track paper identifies its poster stand. For example, with industrial track paper 1.1, look for the stand labeled Ind:1.1.
QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning Guoliang Li (Tsinghua University), Xuanhe Zhou (Tsinghua University), Shifu Li (Huawei Company), and Bo Gao (Huawei Company) Database knob tuning is important to achieve high performance (e.g., high throughput and low latency). However, knob tuning is an NP-hard problem and existing methods have several limitations. First, DBAs cannot tune a lot of database instances on different environments (e.g., different database vendors). Second, traditional machine-learning methods either cannot find good configurations or rely on a lot of high-quality training examples which are rather hard to obtain. Third, they only support coarse-grained tuning (e.g., workload-level tuning) but cannot provide fine-grained tuning (e.g., query-level tuning). To address these problems, we propose a query-aware database tuning system QTune with a deep reinforcement learning (DRL) model, which can efficiently and effectively tune the database configurations. QTune first featurizes the SQL queries by considering rich features of the SQL queries. Then QTune feeds the query features into the DRL model to choose suitable configurations. We propose a Double-State Deep Deterministic Policy Gradient (DS-DDPG) model to enable query-aware database configuration tuning, which utilizes the actor-critic networks to tune the database configurations based on both the query vector and database states. QTune provides three database tuning granularities: query-level, workload-level, and cluster-level tuning. We deployed our techniques onto three real database systems, and experimental results show that QTune achieves high performance and outperforms the state-of-the-art tuning methods.
Smile: A System to Support Machine Learning on EEG Data at Scale Lei Cao (Massachusetts Institute of Technology), Wenbo Tao (Massachusetts Institute of Technology), Sungtae An (Georgia Institute of Technology), Jing Jin (Massachusetts General Hospital), Yizhou Yan (Massachusetts Institute of Technology), Xiaoyu Liu (Massachusetts Institute of Technology), Wendong Ge (Massachusetts General Hospital), Adam Sah (Massachusetts Institute of Technology), Leilani Battle (University of Maryland), Jimeng Sun (Georgia Institute of Technology), Remco Chang (Tufts University), Brandon Westover (Massachusetts General Hospital), Samuel Madden (Massachusetts Institute of Technology), and Michael Stonebraker (Massachusetts Institute of Technology) In order to reduce the possibility of neural injury from seizures and sidestep the need for a neurologist to spend hours on manually reviewing the EEG recording, it is critical to automatically detect and classify "interictal-ictal continuum" (IIC) patterns from EEG data. However, the existing IIC classification techniques are shown to be not accurate and robust enough for clinical use because of the lack of high quality labels of EEG segments as training data. Obtaining high-quality labeled data is traditionally a manual process by trained clinicians that can be tedious, time-consuming, and errorprone. In this work, we propose Smile, an industrial scale system that provides an end-to-end solution to the IIC pattern classification problem. The core components of Smile include a visualizationbased time series labeling module and a deep-learning based active learning module. The labeling module enables the users to explore and label 350 million EEG segments (30TB) at interactive speed. The multiple coordinated views allow the users to examine the EEG signals from both time domain and frequency domain simultaneously. The active learning module first trains a deep neural network that automatically extracts both the local features with respect to each segment itself and the long term dynamics of the EEG signals to classify IIC patterns. Then leveraging the output of the deep learning model, the EEG segments that can best improve the model are selected and prompted to clinicians to label. This process is iterated until the clinicians and the models show high degree of agreement. Our initial experimental results show that our Smile system allows the clinicians to label the EEG segments at will with a response time below 500 ms. The accuracy of the model is progressively improved as more and more high quality labels are acquired over time.
Guided automated learning for query workload re-optimization Guilherme Damasio (Ontario Tech University, IBM Centre for Advanced Studies), Vincent Corvinelli (IBM Ltd), Parke Godfrey (York University, IBM Centre for Advanced Studies), Piotr Mierzejewski (IBM Ltd), Alex Mihaylov (Ontario Tech University, IBM Centre for Advanced Studies), Jaroslaw Szlichta (Ontario Tech University, IBM Centre for Advanced Studies), and Calisto Zuzarte (IBM Ltd) Query optimization is a hallmark of database systems. When an SQL query runs more expensively than is viable or warranted, determination of the performance issues is usually performed manually in consultation with experts through the analysis of query's execution plan (QEP). However, this is an excessively time consuming, human error-prone, and costly process. GALO is a novel system that automates this process. The tool automatically learns recurring problem patterns in query plans over workloads in an offline learning phase, to build a knowledge base of plan-rewrite remedies. It then uses the knowledge base online to re-optimize queries often quite drastically. GALO's knowledge base is built on RDF and SPARQL, W3C graph database standards, which is well suited for manipulating and querying over SQL query plans, which are graphs themselves. GALO acts as a third-tier of re-optimization, after query rewrite and cost-based optimization, as a query plan rewrite. For generality, the context of knowledge base problem patterns, including table and column names, is abstracted with canonical symbol labels. Since the knowledge base is not tied to the context of supplied QEPs, table and column names are matched automatically during the re-optimization phase. Thus, problem patterns learned over a particular query workload can be applied in other query workloads. GALO's knowledge base is also an invaluable tool for database experts to debug query performance issues by tracking to known issues and solutions as well as refining the optimizer with new tuned techniques by the development team. We demonstrate an experimental study of the effectiveness of our techniques over synthetic TPC-DS and real IBM client query workloads.
AliGraph: A Comprehensive Graph Neural Network Platform Rong Zhu (Alibaba Group), Kun Zhao (Alibaba Group), Hongxia Yang (Alibaba Group), Wei Lin (Alibaba Group), Chang Zhou (Alibaba Group), Baole Ai (Alibaba Group), Yong Li (Alibaba Group), and Jingren Zhou (Alibaba Group) An increasing number of machine learning tasks require dealing with large graph datasets, which capture rich and complex relationship among potentially billions of elements. Graph Neural Network (GNN) becomes an effective way to address the graph learning problem by converting the graph data into a low dimensional space while keeping both the structural and property information to the maximum extent and constructing a neural network for training and referencing. However, it is challenging to provide an efficient graph storage and computation capabilities to facilitate GNN training and enable development of new GNN algorithms. In this paper, we present a comprehensive graph neural network system, namely AliGraph, which consists of distributed graph storage, optimized sampling operators and runtime to efficiently support not only existing popular GNNs but also a series of in-house developed ones for different scenarios. The system is currently deployed at Alibaba to support a variety of business scenarios, including product recommendation and personalized search at Alibaba's E-Commerce platform. By conducting extensive experiments on a real-world dataset with 492.90 million vertices, 6.82 billion edges and rich attributes, AliGraph performs an order of magnitude faster in terms of graph building (5 minutes vs hours reported from the state-of-the-art PowerGraph platform). At training, AliGraph runs 40%-50% faster with the novel caching strategy and demonstrates around 12 times speed up with the improved runtime. In addition, our in-house developed GNN models all showcase their statistically significant superiorities in terms of both effectiveness and efficiency (e.g., 4.12%--17.19% lift by F1 scores).
Updating Graph Databases with Cypher Alastair Green (Neo4j), Paolo Guagliardo (University of Edinburgh), Leonid Libkin (University of Edinburgh), Tobias Lindaaker (Neo4j), Victor Marsault (LIGM, UPEM/ESIEE-Paris/ENPC/CNRS), Stefan Plantikow (Neo4j), Martin Schuster (Abbott Informatics), Petra Selmer (Neo4j), and Hannes Voigt (Neo4j) The paper describes the present and the future of graph updates in Cypher, the language of the Neo4j property graph database and several other products. Update features include those with clear analogs in relational databases, as well as those that do not correspond to any relational operators. Moreover, unlike SQL, Cypher updates can be arbitrarily intertwined with querying clauses. After presenting the current state of update features, we point out their shortcomings, most notably violations of atomicity and nondeterministic behavior of updates. These have not been previously known in the Cypher community. We then describe the industry-academia collaboration on designing a revised set of Cypher update operations. Based on discovered shortcomings of update features, a number of possible solutions were devised. They were presented to key Cypher users, who were given the opportunity to comment on how update features are used in real life, and on their preferences for proposed fixes. As the result of the consultation, a new set of update operations for Cypher were designed. Those led to a streamlined syntax, and eliminated the unexpected and problematic behavior that original Cypher updates exhibited.
A Lightweight and Efficient Temporal Database Management System in TDSQL Wei Lu (Renmin University of China), Zhanhao Zhao (Renmin University of China), Xiaoyu Wang (Tencent Inc.), Haixiang Li (Tencent Inc.), Zhenmiao Zhang (Renmin University of China), Zhiyu Shui (Renmin University of China), Sheng Ye (Tencent Inc.), Anqun Pan (Tencent Inc.), and Xiaoyong Du (Renmin University of China) Driven by the recent adoption of temporal expressions into SQL:2011, extensions of temporal support in conventional database management systems (a.b.a. DBMSs) have re-emerged as a research hotspot. In this paper, we present a lightweight yet efficient built-in temporal implementation in Tencent's distributed database management system, namely TDSQL. The novelty of TDSQL's temporal implementation includes: (1) a new temporal data model with the extension of SQL:2011, (2) a built-in temporal implementation with various optimizations, which are also applicable to other DBMSs, and (3) a low-storage-consumption in which only data changes are maintained. For the repeatability purpose, we elaborate the integration of our proposed techniques into MySQL. We conduct extensive experiments on both real-life dataset and synthetic TPC benchmarks by comparing TDSQL with other temporal databases. The results show that TDSQL is lightweight and efficient.
TitAnt: Online Real-time Transaction Fraud Detection in Ant Financial Shaosheng Cao (Ant Financial Services Group), XinXing Yang (Ant Financial Services Group), Cen Chen (Ant Financial Services Group), Jun Zhou (Ant Financial Services Group), Xiaolong Li (Ant Financial Services Group), and Yuan Qi (Ant Financial Services Group) With the explosive growth of e-commerce and the booming of e-payment, detecting online transaction fraud in real time has become increasingly important to Fintech business. To tackle this problem, we introduce the TitAnt, a transaction fraud detection system deployed in Ant Financial, one of the largest Fintech companies in the world. The system is able to predict online real-time transaction fraud in mere milliseconds. We present the problem definition, feature extraction, detection methods, implementation and deployment of the system, as well as empirical effectiveness. Extensive experiments have been conducted on large real-world transaction data to show the effectiveness and the efficiency of the proposed system.
A Distributed System for Large-scale n-gram Language Models at Tencent Qiang Long (Tencent), Wei Wang (National University of Singapore), Jinfu Deng (Tencent), Song Liu (Tencent), Wenhao Huang (Tencent), Fangying Chen (Tencent), and Sifan Liu (Tencent) n-gram language models are widely used in language processing applications, e.g., automatic speech recognition, for ranking the candidate word sequences generated from the generator model, e.g., the acoustic model. Large n-gram models typically give good ranking results; however, they require a huge amount of memory storage. While distributing the model across multiple nodes resolves the memory issue, it nonetheless incurs a great network communication overhead and introduces a different bottleneck. In this paper, we present our distributed system developed at Tencent with novel optimization techniques for reducing the network overhead, including distributed indexing, batching and caching. They reduce the network requests and accelerate the operation on each single node. We also propose a cascade fault-tolerance mechanism which adaptively switches to small n-gram models depending on the severity of the failure. Experimental study on 9 automatic speech recognition (ASR) datasets confirms that our distributed system scales to large models efficiently, effectively and robustly. We have successfully deployed it for Tencent's WeChat ASR with the peak network traffic at the scale of 100 millions of messages per minute.
AnalyticDB: Real-time OLAP Database System at Alibaba Cloud Chaoqun Zhan (Alibaba Group), Maomeng Su (Alibaba Group), Chuangxian Wei (Alibaba Group), (Alibaba Group), Liang Lin (Alibaba Group), Sheng Wang (Alibaba Group), Zhe Chen (Alibaba Group), Feifei Li (Alibaba Group), Yue Pan (Alibaba Group), Fang Zheng (Alibaba Group), and Chengliang Chai (Alibaba Group) With data explosion in scale and variety, OLAP databases play an increasingly important role in serving real-time analysis with low latency (e.g., hundreds of milliseconds), especially when incoming queries are complex and ad hoc in nature. Moreover, these systems are expected to provide high query concurrency and write throughput, and support queries over structured and complex data types (e.g., JSON, vector and texts). In this paper, we introduce AnalyticDB, a real-time OLAP database system developed at Alibaba. AnalyticDB maintains all-column indexes in an asynchronous manner with acceptable overhead, which provides low latency for complex ad-hoc queries. Its storage engine extends hybrid row-column layout for fast retrieval of both structured data and data of complex types. To handle large-scale data with high query concurrency and write throughput, AnalyticDB decouples read and write access paths. To further reduce query latency, novel storage-aware SQL optimizer and execution engine are developed to fully utilize the advantages of the underlying storage and indexes. AnalyticDB has been successfully deployed on Alibaba Cloud to serve numerous customers (both large and small). It is capable of holding 100 trillion rows of records, i.e., 10PB+ in size. At the same time, it is able to serve 10m+ writes and 100k+ queries per second, while completing complex queries within hundreds of milliseconds.
Constant Time Recovery in Azure SQL Database Panagiotis Antonopoulos (Microsoft), Peter Byrne (Microsoft), Wayne Chen (Microsoft), Cristian Diaconu (Microsoft), Raghavendra Thallam Kodandaramaih (Microsoft), Hanuma Kodavalla (Microsoft), Prashanth Purnananda (Microsoft), Adrian-Leonard Radu (Microsoft), Chaitanya Sreenivas Ravella (Microsoft), and Girish Mittur Venkataramanappa (Microsoft) Azure SQL Database and the upcoming release of SQL Server introduce a novel database recovery mechanism that combines traditional ARIES recovery with multi-version concurrency control to achieve database recovery in constant time, regardless of the size of user transactions. Additionally, our algorithm enables continuous transaction log truncation, even in the presence of long running transactions, thereby allowing large data modifications using only a small, constant amount of log space. These capabilities are particularly important for any Cloud database service given a) the constantly increasing database sizes, b) the frequent failures of commodity hardware, c) the strict availability requirements of modern, global applications and d) the fact that software upgrades and other maintenance tasks are managed by the Cloud platform, introducing unexpected failures for the users. This paper describes the design of our recovery algorithm and demonstrates how it allowed us to improve the availability of Azure SQL Database by guaranteeing consistent recovery times of under 3 minutes for 99.999% of recovery cases in production.
S3: A Scalable In-memory Skip-List Index for Key-Value Store Jingtian Zhang (Zhejiang University), Sai Wu (Zhejiang University), Zeyuan Tan (Zhejiang University), Gang Chen (Zhejiang University), Zhushi Cheng (Alibaba Group), Wei Cao (Alibaba Group), Yusong Gao (Alibaba Group), and Xiaojie Feng (Alibaba Group) Many new memory indexing structures have been proposed and outperform current in-memory skip-list index adopted by LevelDB, RocksDB and other key-value systems. However, those new indexes cannot be easily intergrated with key-value systems, because most of them do not consider how the data can be efficiently flushed to disk. Some assumptions, such as fixed size key and value, are unrealistic for real applications. In this paper, we present S3, a scalable in-memory skip-list index for the customized version of RocksDB in Alibaba Cloud. S3 adopts a two-layer structure. In the top layer, a cache-sensitive structure is used to maintain a few guard entries to facilitate the search over the skip-list. In the bottom layer, a semi-ordered skip-list index is built to support highly concurrent insertions and fast lookup and range query. To further improve the performance, we train a neural model to select guard entries intelligently according to the data distribution and query distribution. Experiments on multiple datasets show that S3 achieves a comparable performance to other new memory indexing schemes, and can replace current in-memory skip-list of LevelDB and RocksDB to support huge volume of data.
Native Store Extension for SAP HANA Reza Sherkat (SAP SE), Colin Florendo (SAP SE), Mihnea Andrei (SAP SE), Rolando Blanco (SAP SE), Adrian Dragusanu (SAP SE), Amit Pathak (SAP SE), Pushkar Khadilkar (SAP SE), Neeraj Kulkarni (SAP SE), Christian Lemke (SAP SE), Sebastian Seifert (SAP SE), Sarika Iyer (SAP SE), Sasikanth Gottapu (SAP SE), Robert Schulze (SAP SE), Chaitanya Gottipati (SAP SE), Nirvik Basak (SAP SE), Yanhong Wang (SAP SE), Vivek Kandiyanallur (SAP SE), Santosh Pendap (SAP SE), Dheren Gala (SAP SE), Rajesh Almeida (SAP SE), and Prasanta Ghosh (SAP SE) We present an overview of SAP HANA's Native Store Extension (NSE). This extension substantially increases database capacity, allowing to scale far beyond available system memory. NSE is based on a hybrid in-memory and paged column store architecture composed from data access primitives. These primitives enable the processing of hybrid columns using the same algorithms optimized for traditional HANA's in-memory columns. Using only three key primitives, we fabricated byte-compatible counterparts for complex memory resident data structures (e.g. dictionary and hash-index), compressed schemes (e.g. sparse and run-length encoding), and exotic data types (e.g. geo-spatial). We developed a new buffer cache which optimizes the management of paged resources by smart strategies sensitive to page type and access patterns. The buffer cache integrates with HANA's new execution engine that issues pipelined prefetch requests to improve disk access patterns. A novel load unit configuration, along with a unified persistence format, allows the hybrid column store to dynamically switch between in-memory and paged data access to balance performance and storage economy according to application demands while reducing Total Cost of Ownership (TCO). A new partitioning scheme supports load unit specification at table, partition, and column level. Finally, a new advisor recommends optimal load unit configurations. Our experiments illustrate the performance and memory footprint improvements on typical customer scenarios.
Choosing A Cloud DBMS: Architectures and Tradeoffs Junjay Tan (Brown University), Thanaa Ghanem (Metropolitan State University (Minnesota)), Matthew Perron (MIT CSAIL), Xiangyao Yu (MIT CSAIL), Michael Stonebraker (MIT CSAIL, Tamr, Inc.), David DeWitt (MIT CSAIL), Marco Serafini (University of Massachusetts Amherst), Ashraf Aboulnaga (Qatar Computing Research Institute, HBKU), and Tim Kraska (MIT CSAIL) As analytic (OLAP) applications move to the cloud, DBMSs have shifted from employing a pure shared-nothing design with locally attached storage to a hybrid design that combines the use of shared-storage (e.g., AWS S3) with the use of shared-nothing query execution mechanisms. This paper sheds light on the resulting tradeoffs, which have not been properly identified in previous work. To this end, it evaluates the TPC-H benchmark across a variety of DBMS offerings running in a cloud environment (AWS) on fast 10Gb+ networks, specifically database-as-a-service offerings (Redshift, Athena), query engines (Presto, Hive), and a traditional cloud agnostic OLAP database (Vertica). While these comparisons cannot be apples-to-apples in all cases due to cloud configuration restrictions, we nonetheless identify patterns and design choices that are advantageous. These include prioritizing low-cost object stores like S3 for data storage, using system agnostic yet still performant columnar formats like ORC that allow easy switching to other systems for different workloads, and making features that benefit subsequent runs like query precompilation and caching remote data to faster storage optional rather than required because they disadvantage ad hoc queries.
Adapting TPC-C Benchmark to Measure Performance of Multi-Document Transactions in MongoDB Asya Kamsky (MongoDB Inc) MongoDB is a popular distributed database that supports replication, horizontal partitioning (sharding), a flexible document schema and ACID guarantees on the document level. While it is generally grouped with "NoSQL" databases, MongoDB provides many features similar to those of traditional RDBMS such as secondary indexes, an ad hoc query language, support for complex aggregations, and new as of version 4.0 multi-statement, multi-document ACID transactions. We looked for a well understood OLTP workload benchmark to use in our own system performance test suite to establish a baseline of transaction performance to enable flagging performance regressions, as well as improvements as we continue to add new functionality. While there exist many published and widely used benchmarks for RDBMS OLTP workloads, there are none specifically for document databases. This paper describes the process of adapting an existing traditional RDBMS benchmark to MongoDB query language and transaction semantics to allow measuring transaction performance. We chose to adapt the TPC-C benchmark even though it assumes a relational database schema and SQL, hence extensive changes had to be made to stay consistent with MongoDB best practices. Our goal did not include creating official TPC-C certifiable results, however, every attempt was made to stay consistent with the spirit of the original benchmark specification as well as to be compliant to all specification requirements where possible. We discovered that following best practices for document schema design achieves better performance than using required normalized schema. All the source code used and validation scripts are published in github to allow the reader to recreate and verify our results.
A Morsel-Driven Query Execution Engine for Heterogeneous Multi-Cores Kayhan Dursun (Brown University), Carsten Binnig (TU Darmstadt), Ugur Cetintemel (Brown University), Garret Swart (Oracle Corporation), and Weiwei Gong (Oracle Corporation) Currently, we face the next major shift in processor designs that arose from the physical limitations known as the "dark silicon effect". Due to thermal limitations and shrinking transistor sizes, multi-core scaling is coming to an end. A major new direction that hardware vendors are currently investigating involves specialized and energy-efficient hardware accelerators (e.g., ASICs) placed on the same die as the normal CPU cores. In this paper, we present a novel query processing engine called SiliconDB that targets such heterogeneous processor environments. We leverage the Sparc M7 platform to develop and test our ideas. Based on the SSB benchmarks, as well as other micro benchmarks, we compare the efficiency of SiliconDB with existing execution strategies that make use of co-processors (e.g., FPGAs, GPUs) and demonstrate speed-up improvements of up to 2x.
DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees Charles Masson (Datadog), Jee E. Rim (Datadog), and Homin K. Lee (Datadog) Summary statistics such as the mean and variance are easily maintained for large, distributed data streams, but order statistics (i.e., sample quantiles) can only be approximately summarized. There is extensive literature on maintaining quantile sketches where the emphasis has been on bounding the rank error of the sketch while using little memory. Unfortunately, rank error guarantees do not preclude arbitrarily large relative errors, and this often occurs in practice when the data is heavily skewed. Given the distributed nature of contemporary large-scale systems, another crucial property for quantile sketches is mergeablility, i.e., several combined sketches must be as accurate as a single sketch of the same data. We present the first fully-mergeable, relative-error quantile sketching algorithm with formal guarantees. The sketch is extremely fast and accurate, and is currently being used by Datadog at a wide-scale.
Experiences with Approximating Queries in Microsoft's Production Big-Data Clusters Srikanth Kandula (Microsoft), Kukjin Lee (Microsoft), Surajit Chaudhuri (Microsoft), and Marc Friedman (Microsoft) With the rapidly growing volume of data, it is more attractive than ever to leverage approximations to answer analytic queries. Sampling is a powerful technique which has been studied extensively from the point of view of facilitating approximation. Yet, there has been no large-scale study of effectiveness of sampling techniques in big data systems. In this paper, we describe an in-depth study of the sampling-based approximation techniques that we have deployed in Microsoft's big data clusters. We explain the choices we made to implement approximation, identify the usage cases, and study detailed data that sheds insight on the usefulness of doing sampling based approximation.
SAP HANA goes private - From Privacy Research to Privacy Aware Enterprise Analytics Stephan Kessler (SAP SE), Jens Hoff (SAP SE), and Johann-Christoph Freytag (Humboldt-Universität zu Berlin) Over the last 20 years, the progress of information technology has allowed many companies to generate, integrate, store, and analyze data of unprecedented size and complexity. In many cases, this data is personal data and how it can be used is therefore subject to laws that depend on the specific countries and application domains. For example, the General Data Protection Regulation (GDPR) introduced in the European Union imposes strict rules on how personal data can be processed. Analyzing personal data can create tremendous value, but at the same time companies must ensure that they remain legally compliant. Unfortunately, existing systems offer only limited or no support at all for processing personal data in a privacy-aware manner. Approaches that have emerged from the academic and industrial research environments need to be integrated into large systems (like enterprise systems) in a manageable and scalable way. In many IT environments, it is also desirable and necessary to combine and to integrate personal data with other (non-personal) data in a seamless fashion. In this paper, we present the first steps that SAP has taken to provide its database management system SAP HANA with privacy-enhanced processing capabilities, referred to in the following as SAP HANA Data Anonymization. Various goals on both the conceptual and technical levels were followed with the aim of providing SAP customers today with an integrated processing environment for personal and non-personal data.
Tunable Consistency in MongoDB William Schultz (MongoDB, Inc.), Tess Avitabile (MongoDB, Inc.), and Alyson Cabral (MongoDB, Inc.) Distributed databases offer high availability but can impose high costs on client applications in order to maintain strong consistency at all times. MongoDB is a document oriented database whose replication system provides a variety of consistency levels allowing client applications to select the trade-offs they want to make when it comes to consistency and latency, at a per operation level. In this paper we discuss the tunable consistency models in MongoDB replication and their utility for application developers. We discuss how the MongoDB replication system's speculative execution model and data rollback protocol help make this spectrum of consistency levels possible. We also present case studies of how these consistency levels are used in real world applications, along with a characterization of their performance benefits and trade-offs.
Customizable and Scalable Fuzzy Join for Big Data Zhimin Chen (Microsoft Research), Yue Wang (Microsoft Research), Vivek Narasayya (Microsoft Research), and Surajit Chaudhuri (Microsoft Research) Fuzzy join is an important primitive for data cleaning. The ability to customize fuzzy join is crucial to allow applications to address domain-specific data quality issues such as synonyms and abbreviations. While efficient indexing techniques exist for single-node implementations of customizable fuzzy join, the state-of-the-art scale-out techniques do not support customization, and exhibit poor performance and scalability characteristics. We describe the design of a scale-out fuzzy join operator that supports customization. We use a locality-sensitive-hashing (LSH) based signature scheme, and introduce optimizations that result in significant speed up with negligible impact on recall. We evaluate our implementation on the Azure Databricks version of Spark using several real-world and synthetic data sets. We observe speedups exceeding 50X compared to the best-known prior scale-out technique, and close to linear scalability with data size and number of nodes.