ROUGH DRAFT authorea.com/17161

# Claret: Using Data Types for Highly Concurrent Distributed Transactions (Preprint:PaPoC)

Abstract

Out of the many NoSQL databases in use today, some that provide simple data structures for records, such as Redis and MongoDB, are now becoming popular. Building applications using these data structures rather than plain string values provides programmers with a way to communicate intent to the database system without sacrificing flexibility or committing to a fixed schema. Currently, this expressiveness is used to ensure related values are co-located so they can be quickly accessed together (e.g. maps and other aggregates) and to provide complex atomic operations (e.g. set insertion). However, there are many more ways in which data types can be used to make databases more efficient and simpler to use that are not yet being exploited.

In this work, we demonstrate several ways of leveraging data structure semantics in databases, focusing primarily on commutativity. Reasoning about operation reordering can allow transactions to execute concurrently that would conflict under traditional concurrency control. Using Retwis, a Twitter clone built for Redis, as a case study, we show that using commutativity can reduce transaction abort rates for high-contention, update-heavy workloads that arise in real social networks. We conclude that data types are a good abstraction for database records, providing a safe and expressive programming model with ample opportunities for optimization, which will make databases more safe and scalable.

# Introduction

The move to non-relational (NoSQL) databases was motivated by a desire for scalability and flexibility. People found that by not enforcing strong consistency, they could better scale services to millions or billions of users while meeting tight performance goals. By forgoing the structure of relational schemas, they gained more direct control over performance at the cost of significant automatic optimization. The observation has been that relaxed or eventual consistency is a good fit for many applications built on top of these NoSQL because potential conflicts are difficult to observe in practice and that users are likely to accept minor inconsistencies, such as two tweets being out of temporal order on their Twitter timeline. However, this leaves much to chance; there is likely no guarantee that more significant inconsistencies are impossible.

In situations where consistency really is crucial, developers can enforce stronger guarantees on top of any platform, but they often get it wrong. Google, recognizing the benefit to programmer productivity, reintroduced serializable transactions to their next generation database, Spanner (Corbett 2012). However, this leaves them with two extreme choices, with significant performance impact. If certain parts of an application can tolerate imprecision, then why not capture those properties in the database programming model? What if the programmer could express the semantics they desire succintly and precisely, in a way that the database can better reason about conflicts and optimize performance, without giving up the flexibility and scalability of their existing systems?

Abstract data types (ADTs) are the solution to this problem. Rather than limiting the records in databases to being primitive types like strings or integers, raising them to more complex data types provides a richer interface, exposing ample opportunities for optimization to the database and a precise mechanism to express the intentions of programmers.

Performance benefits come from understanding the properties of ADT operations: those that commute with each other can be performed concurrently, even on multiple copies of the record. This means that transactions whose operations commute abort less, approaching the performance without transactions. This cleanly captures the cases described above where conflicts were unobservable or ordering didn't matter but in a safe way because any operations that don't in fact commute will be handled with traditional concurrency control. Using insights from multi-core concurrent data structures, we show in Section \ref{sec:commutativity} that it is practical to reason about the matrix of commutativity among operations and build implementations that make the right tradeoff between the amount of concurrency allowed and the efficiency of tracking it.

Selecting the right data type for the job gives programmers a clean, precise way to express their desired behavior. For instance, rather than using a generic integer to generate unique identifiers, a UUID type, realizing that contiguity of ids isn't necessary, can be trivially parallelized and distributed. Though this is an extremely simple case (and nearly universally adopted optimization), it fits the same mold as more nuanced decisions, such as choosing to represent the number of times a given post has been "retweeted" as a HyperLogLog, which can efficiently yield the approximate number of unique users, rather than a costly precise Set. Though selecting data structures for the job at hand is nothing new to programmers, only a handful of databases, such as Redis, MongoDB or Riak, support this flexibility, and they do not use the abstraction it affords to enforce strongly consistent transactions.

method: commutes with: when:
add(x): void add(y) $$\forall x, y$$
remove(x): void remove(y) $$\forall x, y$$