Multi-Datacenter Replication Reference: Monitoring

Monitoring Riak’s realtime replication allows you to identify trends and to receive alerts during times when replication is halted or delayed. Issues or delays in replication can be caused by:

Sudden increases or spikes in write traffic
Network connectivity issues or outages
Errors experienced in Riak

Identification and trending of issues or delays in realtime replication is important for identifying a root cause, while alerting is important for addressing any SLA-impacting issues or delays. We recommend combining the two approaches below when monitoring Riak’s realtime replication:

Monitor Riak’s replication status output, from either riak-repl status or the HTTP /riak-repl/stats endpoint
Use canary (test) objects to test replication and establish trip times from source to sink clusters

Note on querying and time windows

Riak’s statistics are calculated over a sliding 60-second window. Each time you query the stats interface, each sliding statistic shown is a sum or histogram value calculated from the previous 60 seconds of data. Because of this, the stats interface should not be queried more than once per minute.

Statistics

The following questions can be answered through the monitoring and graphing of realtime replication statistics:

Is the realtime replication queue backed up?
Have any errors occurred on either the source or sink cluster?
Have any objects been dropped from the realtime queue?

Is the realtime replication queue backed up?

Identifying times when the realtime replication queue experiences increases in the number of pending objects can help identify problems with realtime replication or identify times when replication becomes overloaded due to increases in traffic. The pending statistic, found under the realtime_queue_stats section of the replication status output, should be monitored and graphed. Graphing this statistic allows you to identify trends in the number of pending objects. Any repeating or predictable trend in this statistic can be used to help identify a need for tuning and capacity changes, while unexpected variation in this statistic may indicate either sudden changes in load or errors at the network, system, or Riak level.

Have any errors occurred on either the source or sink cluster?

Errors experienced on either the source or sink cluster can result in failure to replicate object(s) via realtime replication. The top-level rt_dirty statistic in riak-repl status indicates whether such an error has occurred and how many times. This statistic only tracks errors and does not definitively indicate that an object was not successfully replicated. For this reason, a fullsync should be performed any time rt_dirty is non-zero. rt_dirty is then reset to zero once a fullsync successfully completes.

The size of rt_dirty can quantify the number of errors that have occurred and should be graphed. Since any non-zero value indicates an error, an alert should be set so that a fullsync can be performed (if not regularly scheduled). Like realtime queue back ups, trends in rt_dirty can reveal problems with the network, system, or Riak.

Have any objects been dropped from the realtime queue?

The realtime replication queue will drop objects when the queue is full, with the dropped object(s) being the last (oldest) in the queue. Each time an object is dropped, the drops statistic, which can be found under the realtime_queue_stats section of the replication status output, is incremented. An object dropped from the queue has not been replicated successfully, and a fullsync should be performed when a drop occurs. A dropped object can indicate a halt or delay in replication or indicate that the realtime queue is overloaded. In cases of high load, increases to the maximum size of the queue (displayed in the realtime_queue_stats section of the replication status output as max_bytes) can be made to accommodate a usage pattern of expected high load.

Although the above statistics have been highlighted to answer specific questions, other statistics can also be helpful in diagnosing issues with realtime replication. We recommend graphing any statistic that is reported as a number. While their values and trends may not answer common questions or those we’ve highlighted here, they may nonetheless be important when investigating issues in the future. Other questions that cannot be answered through statistics alone may be addressed through the use of canary objects.

Canary Objects

Canary object testing is a technique that uses a test object stored in your environment with your production data but not used or modified by your application. This allows the test object to have predictable states and to be used to answer questions about the functionality and duration of realtime replication.

The general process for using canary objects to test realtime replication is:

Perform a GET for your canary object on both your source and sink clusters, noting their states. The state of the object in each cluster can be referred to as state S0, or the object’s initial state.
PUT an update for your canary object to the source cluster, updating the state of the object to the next state, S1.
Perform a GET for your canary on the sink cluster, comparing the state of the object on the source cluster to the state of the object on the sink cluster.

By expanding upon the general process above, the following questions can be answered:

Is a backed-up realtime replication queue still replicating objects within a defined SLA?
How long is it taking for objects to be replicated from the source cluster to the sink cluster?

Is a backed-up realtime replication queue still replicating objects within a defined SLA?

Building on the final step of the general process, we can determine if our objects are being replicated from the source cluster to the sink cluster within a certain SLA time period by adding the following steps:

If the state of the object on the source cluster is not equal to the state of the object on the sink cluster, repeat step 3 until an SLA time threshold is exceeded.
If the SLA time threshold is exceeded, alert that replication is not meeting the necessary SLA.

How long is it taking for objects to be replicated from the source cluster to the sink cluster?

Getting a rough estimate of how long it takes an object PUT to a source cluster to be replicated to a sink cluster get be done by either:

Comparing the time the object was PUT to the source with the time the states of the object in the source and sink were equivalent
Comparing the timestamps of the object on the source and sink when the states are equivalent

These are rough estimates, as neither method is 100% accurate. The first method relies on a timestamp for a GET and subsequent successful comparison, which means that the object was replicated prior to that timestamp; the second method relies on the system clocks of two different machines, which may not be in sync.

It’s important to note that each node in a cluster has its own realtime replication queue. The general process needs to be applied to every node in the source cluster, with a variety of canary objects and states, to get a complete picture of realtime replication between two clusters.