Riak provides data related to current operating status, which includes
statistics in the form of counters and histograms. These statistics are
made available through the HTTP API via the
endpoint, or through the
interface, in particular the
This page presents the most commonly monitored and gathered statistics, as well as numerous solutions for monitoring and gathering statistics that our customers and community report using successfully in Riak cluster environments. You can learn more about the specific Riak statistics provided in the Inspecting a Node and HTTP Status documentation.
riak-admin tool provides two
interfaces for retrieving statistics and other information:
riak-admin status command will return all of the currently
available information from a running node.
This will return a list of over 300 key/value pairs, like this:
1-minute stats for 'firstname.lastname@example.org' ------------------------------------------- connected_nodes : ['email@example.com','firstname.lastname@example.org'] consistent_get_objsize_100 : 0 consistent_get_objsize_195 : 0 ... etc ...
A comprehensive list of available stats can be found in the Inspecting a Node document.
riak-admin stat command is related to the
command but provides a more fine-grained interface for interacting with
stats and information. Full documentation of this command can be found
in the Inspecting a Node document.
Riak provides counters for GETs, PUTs, read repairs, and other common operations. By default, these counters count either the number of operations occurring within the last minute, or for the runtime duration of the node.
Gets and Puts
GET/PUT counters are provided for both nodes and vnodes. These counters are commonly graphed over time for trend analysis, capacity planning, and so forth.
At a minimum, the following stats should be graphed:
Read repair counters are commonly graphed and monitored for abnormally high totals, which can be indicative of an issue.
Counters representing the number of coordinated node redirection operations are provided in total since node startup.
Riak provides statistics for a range of operations. By default, Riak provides the mean, median, 95th percentile, 99th percentile, and 100th percentile over a 60 second window.
Finite State Machine Time
Riak exposes Finite State Machine (FSM) time counters
node_put_fsm_time_*) that measure the amount
of time in microseconds required to traverse the GET or PUT FSM code,
offering a picture of general node health.
GET FSM Object Size
GET FSM Object Size (
node_get_fsm_objsize_*) measures the size of
objects flowing through this node's GET finite state machine (GET_FSM).
The size of an object is obtained by summing the length of the object's
bucket name, key, serialized vector clock, value, and the serialized
metadata of each sibling.
GET FSM Siblings
GET FSM Sibling (
node_get_fsm_siblings_*) provides a histogram (with
60 second window) of the number of siblings encountered by this node on
the occasion of a GET request.
Additional Riak Metrics to Graph
Systems Metrics To Graph
- Available disk space
- Read operations
- Write operations
- Network throughput
- Load average
We also recommend tracking your system's virtual and writebacks. Things like massive flushes of dirty pages or steadily climbing writeback volumes can indicate poor virtual memory tuning. More information can be found here and in our documentation on system tuning.
Statistics and Monitoring Tools
There are many open source, self-hosted, and service-based solutions for aggregating and analyzing statistics and log data for the purposes of monitoring, alerting, and trend analysis on a Riak cluster. Some solutions provide Riak-specific modules or plugins as noted.
The following are solutions which customers and community members have reported success with when used for monitoring the operational status of their Riak clusters. Community and open source projects are presented along with commercial and hosted services.
Community and Open Source Tools
Riaknostic is a growing suite of diagnostic checks that can be run against your Riak node to discover common problems and recommend how to resolve them. These checks are derived from the experience of the Basho Client Services Team as well as numerous public discussions on the mailing list, IRC room, and other online media.
Riaknostic integrates into the
riak-admin command via a
subcommand, and is a great first step in the process of diagnosing and
troubleshooting issues on Riak nodes.
Riak Control is Basho's REST-driven user-interface for managing Riak clusters. It is designed to give you quick insight into the health of your cluster and allow for easy management of nodes.
While Riak Control does not currently offer specific monitoring and statistics aggregation or analysis functionality, it does offer features which provide immediate insight into overall cluster health, node status, and handoff operations.
collectd gathers statistics about the system it is running on and stores them. The statistics are then typically graphed to find current performance bottlenecks, predict system load, and analyze trends.
Ganglia is a monitoring system specifically designed for large, high-performance groups of computers, such as clusters and grids. Customers and community members using Riak have reported success in using Ganglia to monitor Riak clusters.
Nagios is a monitoring and alerting solution that can provide information on the status of Riak cluster nodes, in addition to various types of alerting when particular events occur. Nagios also offers logging and reporting of events and can be used for identifying trends and capacity planning.
A collection of reusable Riak-specific scripts are available to the community for use with Nagios.
Riemann uses a powerful stream processing language to aggregate events from client agents running on Riak nodes, and can help track trends or report on events as they occur. Statistics can be gathered from your nodes and forwarded to a solution such as Graphite for producing related graphs.
A Riemann Tools project consisting of small programs for sending data to Riemann provides a module specifically designed to read Riak statistics.
OpenTSDB is a distributed, scalable Time Series Database (TSDB) used to store, index, and serve metrics from various sources. It can collect data at a large scale and graph these metrics on the fly.
Commercial and Hosted Service Tools
The following are some commercial tools which Basho customers have reported successfully using for statistics gathering and monitoring within their Riak clusters.
Circonus provides organization-wide monitoring, trend analysis, alerting, notifications, and dashboards. It can been used to provide trend analysis and help with troubleshooting and capacity planning in a Riak cluster environment.
Splunk is available as downloadable software or
as a service, and provides tools for visualization of machine generated
data such as log files. It can be connected to Riak's HTTP statistics
Splunk can be used to aggregate all Riak cluster node operational log files, including operating system and Riak-specific logs and Riak statistics data. These data are then available for real time graphing, search, and other visualization ideal for troubleshooting complex issues and spotting trends.
Riak exposes numerous forms of vital statistic information which can be aggregated, monitored, analyzed, graphed, and reported on in a variety of ways using numerous open source and commercial solutions.
If you use a solution not listed here with Riak and would like to include it (or would otherwise like to update the information on this page), feel free to fork the docs, add it in the appropriate section, and send a pull request to the Riak Docs.