Product tutorials, how-tos, and fully-documented APIs.

Eventual Consistency

    An introduction to eventual consistency and what it means in terms of handling data with Riak.

    In a distributed and fault-tolerant environment like Riak, the unexpected is expected. That means that nodes may leave and join the cluster at any time, be it by accident (node failure, network partition, etc.) or on purpose, e.g. by explicitly removing a node from a Riak cluster. Even with one or more nodes down, the cluster is still expected to accept writes and serve reads, and the system is expected to return the same data from all nodes eventually, even the failed ones after they rejoined the cluster.

    A simple example of eventual consistency

    This basis for a simple example is a Riak cluster with five nodes and a default quorum of 3. That means every piece of data exists three times in this cluster. In this setup reads use a quorum of 2 to ensure at least two copies, whereas writes also use a quorum of 2 to enforce strong consistency.

    When data is written with a quorum of 2, Riak sends the write request to all three replicas anyway, but returns a successful reply when two of them acknowledged a successful write on their end. One node may even have been down, not acknowledging the write.

    The purpose of this list of examples is to describe how the replica node that just failed gets its data back, so that at some undetermined point in the future, all replicas of a piece of data will be up-to-date. There are ways of controlling when a specific piece of data is consistent across all its replicas, but either way, the data will eventually be consistent on all of them.

    Most of these scenarios are easy enough to reproduce. Most of the time you just need to determine the preference list for a key on a healthy cluster, take down X nodes and try a combination of read and write requests to see what happens in each case. It's also good to keep an eye on the logs to find out when hinted handoffs occur.

    Before we dive into the failure scenarios, let's examine how a request in Riak is handled and spread across the replicas.

    Anatomy of a Riak Request

    To understand how eventual consistency is handled in a Riak environment, it's important to know how a request is handled internally. There's not much magic involved, but it helps to understand things like read repair, and how the quorum is handled in read and write requests. Be sure to read the wiki page on Replication first, it has all the details on how data is replicated across nodes and what happens when a node becomes unavailable for read and write requests, the basics for how eventual consistency is handled in Riak.

    Recall that every key belongs to N primary virtual nodes (vnodes) which are running on the physical nodes assigned to them in the ring. Secondary virtual nodes are run on nodes that are close to the primaries in the key space and stand-in for primaries when they are unavailable (also called fallbacks).

    The basic steps of a request in Riak are the following:

    The steps are similar for both read and write requests, with some details different, we'll go into the differences in the examples below.

    In our example cluster, we'll assume that it's healthy and all nodes are available, that means sending requests to all three primary replicas of the key requested.

    Failure Scenarios

    Now we'll go through a bunch of failure scenarios that could result in data inconsistencies, and we'll explore how they're resolved. Every scenario assumes a healthy cluster to begin with, where all nodes are available.

    In a typical failure scenario, at least one node goes down, leaving two replicas intact in the cluster. Clients can expect that reads with an R of 2 will still succeed, until the third replica comes back up again. It's up to the application's details to implement some sort of graceful degradation in an automated fashion or as a feature flip that can be tuned at runtime accordingly, or to simply retry when a piece of data is expected to be found, but a first request results in not found.

    Reading When One Primary Fails

    Note that if we had requested with R=2 or less, the first request would have succeeded because 2 replicas are available.

    Reading When Two Primaries Fail

    This is similar to the scenario above, but initial read consistency expectations may degrade even further, leaving only one initial replica.

    Reading When Three Primaries Fail

    This incident will always yield a not found error, as no node is able to serve the request. Read-repair will not occur because no replicas will be found.

    Writing And Reading When One Primary Failed

    Writing And Reading When One Primary Failed and Later Recovered

    Edge Case: Writing And Reading During Handoff

    This is an extension of the previous scenario. The primary has recovered from failure and is again available to the cluster. As hinted handoff kicks in, the node has already re-claimed its position in the preference list and will serve requests for the key. Should the recovered node be hit with requests during handoff it's likely to return not_founds for data that has been written to a secondary during its unavailability.

    In situations of limited degradation, this will not be an issue because the other two primaries will have the new data available; however in cases where multiple nodes have failed, you may experience a period of increased not found responses as the primaries catch up. In these cases, we encourage retrying the request a limited number of times until successful.

    Writing And Reading When Two Primaries Failed and One Recovers

    This is similar to the failure of a single primary. When one primary recovers, either hinted handoff or read repair kick in to ensure a following read with the default quorum can be served successfully. The same happens when the second failed primary comes back up again.

    Read Your Own Writes

    It is desirable to always be able to read your own writes, even in a distributed database environment like Riak. When you write to a particular node with a W value that corresponds to the N value, the number of predefined replicas in your cluster, you can assume that a subsequent read will return the value you just wrote, given that no node in the cluster failed just that instance.

    However, node failure can occur at any point in time, so your application should be prepared to retry reads for objects that it expects to exist, both for reads on the same nodes as it just sent a write to, but also for reads from other nodes in the cluster.

    There is no guarantee that a write will make it to all replicas before another client asks a different node for the same object. A write may still be in flight, waiting for confirmation from even just one vnode, while another client already tries to read from a different or even the same physical node. Clients are unaware of partition and replica placement in your cluster, so they'll have to work around the potential issues that can occur. Latency can lead to unpredictable circumstances in these scenarios even in the milliseconds range.

    In your application, if you rely on and expect objects you read to exist at any time, be prepared to retry a number of times when your code receives a not_found from Riak. Ensure some way of exponential backoff and eventual failure or simply giving up in your application's code, when you can safely assume the value is indeed nonexistent, returning an older value or simply no value at all.

    Further Reading