Since its inception in the 1980s, distributed consensus has been the subject of extensive academic research. Whilst definitions vary, distributed consensus (or equivalently, atomic broadcast) most often refers to the problem of how to decide an ordered sequence of values between a set of distributed nodes. This can be used to implement an append-only replicated log which can be utilized either directly or indirectly, to provide services such as primary backup replication or state machine replication. These abstractions can, in turn, form the building blocks of new abstractions, such as a distributed key-value store. Some consensus algorithms instead decide only a single value or a partially ordered sequence of values.
What unifies distributed consensus algorithms is the fact that they are always safe, regardless of delays and crashes (though they are not necessarily Byzantine fault tolerance), and are guaranteed to make progress provided sufficient liveness.
This is a long list of papers relating to distributed consensus. Many of the papers listed below fit into more than one section. However, for simplicity, each paper is listed only in the most relevant section. Where possible, open access links for each paper have been provided. Contributions via pull requests are welcome.
⭐️ Influential papers - If you are looking for a starting point, a subset of the most influential papers on distributed consensus are highlighted using a yellow star. ⭐️
💎 Hidden gems - Papers which I personally love but are not as highly cited as the influential papers 💎
SoK: A Generalized Multi-Leader State Machine Replication Tutorial, JSys 2021 [pdf]
Algorithms for consensus
This section lists papers describing algorithms for distributed consensus. These papers tend to be theory papers (venues such as PODC, DISC, OPODIS) whereas the Implementations of consensus section focuses on systems papers.
Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols, PODC 1983 [acmdl]
As known as the Ben-Or algorithm
Reliable communication in the presence of failures, TOCS 1987 [acmdl,pdf]
⭐️ Consensus in the Presence of Partial Synchrony, JACM 1988 [acmdl,pdf]
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems, PODC 1988 [acmdl,pdf]
This paper describes how to replace acceptors in Paxos with disks
Each disk is divided into blocks, one for each proposer. Each proposer may only write to its own block and read from other blocks, which they do in each of the two usual Paxos phases
Each block contains the rough equivalent to last promised ballot number and last accepted proposal for the assigned proposer
Specifying and Using a Partitionable Group Communication Service, TOCS 2001 [acmdl,pdf]
Active Disk Paxos with infinitely many processes, PODC 2002 [acmdl,pdf]
This paper makes Disk Paxos more “Paxos like” by assuming the disks support more operations e.g. conditional write
ADP claims that Disk Paxos requires a fixed set of proposers and that ADP fixes this.
Variant of Paxos where proposers can bypass the leader by allowing multiple values to be proposed in the same ballot. This requires stronger quorums intersection, e.g. fast paxos needs 3/4 of acceptors (instead of a simple majority) to provide the same liveness guarantees as classic Paxos.
Consensus on Transaction Commit, TODS 2006 [acmdl,pdf]
Variant of Paxos which replaces the one leader with a group of leaders. Clients send operations to all leaders and they all propose values to the acceptors. Acceptors only accept a value if they have received proposals from a quorum of leaders. Similar to the non-equivocation phase in BFT. Liveness now does not depend on the leader.
On Collision-fast Atomic Broadcast, AINA 2014 [pdf]
Paxos Quorum Leases: Fast Reads Without Sacrificing Writes, SOCC 2014 [acmdl,pdf]
Extends the idea of master read leases to allow the master to promise to use a specified subset of acceptors in every majority quorum. Acceptors in this quorum can then serve reads locally.
Similar to master read leases, it relies on clock synchrony.
HovercRaft: Achieving Scalability and Fault-tolerance for microsecond-scale Datacenter Services, Eurosys 2020 [acmdl]
FLAIR: Accelerating Reads with Consistency-Aware Network Routing, NSDI 2020 [acmdl,pdf]
Microsecond Consensus for Microsecond Applications, OSDI 2020 [arxiv]
High availability in cheap distributed key value storage, SoCC 2020 [acmdl]
Odyssey: The Impact of Modern Hardware on Strongly-Consistent Replication Protocols, Eurosys 2021 [acmdl, pdf,techreport,thesis]
Consensus for geo-distributed systems
This section covers papers describing consensus algorithms for WANs and/or geo-replicated systems. Many of these algorithms (such as EPaxos) are leaderless and decide a partial-ordering over values instead of the more traditional total-ordering approach.
Mencius: Building Efficient Replicated State Machines for WANs, OSDI 2008 [acmdl,pdf]
Scalable Consistency in Scatter, SOSP 2011 [acmdl,pdf]
MDCC: Multi-Data Center Consistency, Eurosys 2013 [acmdl,pdf]
There Is More Consensus in Egalitarian Parliaments, SOSP 2013 [acmdl,pdf]
This paper describes EPaxos, which realizes Generalized Paxos and makes some further improvements (e.g. reducing the size of fast quorums by 1).
Geo-replicated storage with scalable deferred update replication, DSN 2013 [acmdl,pdf]
Low-Latency Multi-Datacenter Databases using Replicated Commit, VLDB 2013 [acmdl,pdf]
Be General and Don’t Give Up Consistency in Geo-Replicated Transactional Systems, OPODIS 2014 [pdf]
CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems, FAST 2015 [acmdl,pdf]
GlobalFS: A Strongly Consistent Multi-Site File System, SRDS 2016 [pdf]
Canopus: A Scalable and Massively Parallel Consensus Protocol, CoNEXT 2017 [acmdl,pdf]
Multileader WAN Paxos: Ruling the Archipelago with Fast Consensus, Tech report 2017 [pdf]
WPaxos: Wide Area Network Flexible Consensus, Unpublished 2017 [pdf]
Windows Azure Storage: a highly available cloud storage service with strong consistency, SOSP 2011 [acmdl,pdf]
Megastore: Providing Scalable, Highly Available Storage for Interactive Services, CIDR 2011 [pdf]
Megastore uses SMR with witnesses, replicas that participate in log replication but do not run a state machine and read-only replicas that only run a state machine. This paper seems to use an unusual definition of Multi-Paxos where each instance is district but the 1a/1b messages for slot i is piggybacked onto 2a2/b for slot i-1.
Zab: High-performance broadcast for primary-backup systems, DSN 2011 [acmdl,pdf]
Widely utilized Apache licensed open source project written in Java project website
Apache Kafka uses Zookeeper, as well as its own replication protocol, by described here. This is no longer true.
Architecture is similar to Google's Chubby but unlike Chubby is described in detail and is open source
Writes are linearizable, reads may be stale
Note: calling sync before a write doesn't make it linearizable
Clients may have multiple outstanding requests, they will be handled FIFO
Uses primary-backup replication instead of state machine replication
Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols, HotCloud 2017 [pdf,acmdl,slides]
Bolt-On Global Consistency for the Cloud, SoCC 2018 [acmdl,pdf]
Stable and Consistent Membership at Scale with Rapid, ATC 2018 [pdf]
Uses Fast Paxos to decide on membership changes. Conflicts are rare as the proposed value is the output of a membership algorithm so proposers usually propose the same proposal.
Blockchains and Distributed Databases: a Twin Study [arxiv]
Performance anaylsis of 5 consensus systems, 3 non-byzantine algorithms (including etcd) and 2 byzantine consensus algorithms
Scalable but Wasteful: Current State of Replication in the Cloud, HotStorage 2021 [acmdl]
Study of the efficiency (CPU utilization) of Multi-Paxos vs EPaxos, finding that EPaxos provides better throughput than Multi-Paxos at the cost of much worse efficiency.
State machine replication
This section lists papers about the application of consensus to State Machine Replication (SMR/RSMs) and Linearizability.
⭐️ Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial, CSUR 1990 [acmdl,pdf]
⭐️ Linearizability: A Correctness Condition for Concurrent Objects, TOPLAS 1990 [acmdl,pdf]
Implementing Linearizability at Large Scale and Low Latency, SOSP 2015 [acmdl,pdf]
Cheap and Available State Machine Replication, ATC 2016 [acmdl,pdf]
Fine-Grained Replicated State Machines for a Cluster Storage System, NSDI 2020 [pdf]
Reconfiguration
This section lists papers on reconfiguration & leader election.
The SMART Way to Migrate Replicated Stateful Services, EuroSys 2006 [acmdl,pdf]
💎 Vertical Paxos and Primary-Backup Replication, PODC 2009 [acmdl,pdf]
Reconfiguring a State Machine, SIGACT News 2010 [acmdl,pdf]
Dynamic Reconfiguration of Primary/Backup Clusters, ATC 2012 [acmdl,pdf]
An Analysis of Network-Partitioning Failures in Cloud Systems, OSDI 2018 [acmdl,pdf]
CrashTuner: Detecting Crash-Recovery Bugs in Cloud Systems via Meta-Info Analysis, SOSP 2019 [acmdl]
The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure, SOSP 2019 [acmdl]
Toward a Generic Fault Tolerance Technique for Partial Network Partitioning, OSDI 2020 [pdf]
Tolerating Slowdowns in Replicated State Machines using Copilots, OSDI 2020 [pdf]
Metastable Failures in Distributed Systems, HotOS 2021 [acmdl]
Immunizing Systems from Distant Failures by Limiting Lamport Exposure, HotNets 2021 [acmdl]
Cores That Don’t Count, HotOS 2021 [acmdl,pdf,talk]
Clocks
The liveness of distributed consensus depends on some degree of clock synchronization. The following section lists papers on the topic of clock synchronization.
IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems, Standard 1588-2008 [ieee]
Globally Synchronized Time via Datacenter Networks, SIGCOMM 2016 [acmdl,pdf]
Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization, NSDI 2018 [acmdl,pdf]
Sundial: Fault-tolerant Clock Synchronization for Datacenters, OSDI 2020 [pdf]