Statistics from Riak
Riak provides data related to current operating status, which includes statistics in the form of counters and histograms. These statistics are made available through the HTTP API via the
/stats endpoint, or through the
riak-admin status command.
This page presents the most commonly monitored and gathered statistics, as well as numerous solutions for monitoring and gathering statistics that our customers and community report using successfully in Riak cluster environments. You can learn more about the specific Riak statistics provided in the Inspecting a Node documentation.
Riak provides counters for GETs, PUTs, read repairs, and other common operations. By default, these counters count either the number of operations occurring within the last minute, or for the runtime duration of the node.
Gets and Puts
GET/PUT counters are provided for both nodes and vnodes. These counters are commonly graphed over time for trend analysis, capacity planning, and so forth.
||Number of GETs coordinated by this node within the last minute, including GETs to non-local vnodes on this node|
||Number of GETs coordinated by this node since startup, including GETs to non-local vnodes|
||Number of PUTs coordinated by this node, including PUTs to non-local vnodes on this node within the last minute|
||Number of PUTs coordinated by this node since startup, including PUTs to non-local vnodes|
||Number of GET operations coordinated by vnodes on this node within the last minute|
||Number of GETs coordinated by local vnodes since node startup|
||Number of PUTS coordinated by local vnodes since node startup|
Read repair counters are commonly graphed and monitored for abnormally high totals, which can be indicative of an issue.
||Number of read repair operations this node has coordinated in the last minute|
||Number of read repair operations this node has coordinated since the node was started|
Counters representing the number of coordinated node redirection operations are provided in total since node startup.
||Number of requests this node has redirected to other nodes for coordination since startup|
Riak provides statistics for a range of operations. By default, Riak provides the mean, median, 95th percentile, 99th percentile, and 100th percentile over a 60 second window.
Finite State Machine Time
Riak exposes Finite State Machine (FSM) time counters (
node_put_fsm_time_*) that measure the amount of time in microseconds required to traverse the GET or PUT FSM code, offering a picture of general node health.
GET FSM Object Size
GET FSM Object Size (
node_get_fsm_objsize_*) measures the size of objects flowing through this node's GET finite state machine (GET_FSM). The size of an object is obtained by summing the length of the object's bucket name, key, serialized vector clock, value, and the serialized metadata of each sibling.
GET FSM Siblings
GET FSM Sibling (
node_get_fsm_siblings_*) provides a histogram (with 60 second window) of the number of siblings encountered by this node on the occasion of a GET request.
Riak Metrics To Graph
||Mean object size encountered by this node within the last minute|
||Median object size encountered by this node within the last minute|
||95th percentile object size encountered by this node within the last minute|
||100th percentile object size encountered by this node within the last minute|
||Mean time between reception of client GET request and subsequent response to client|
||Median time between reception of client GET request and subsequent response to client|
||95th percentile time between reception of client GET request and subsequent response to client|
||100th percentile time between reception of client GET request and subsequent response to client|
||Mean time between reception of client PUT request and subsequent response to client|
||Median time between reception of client PUT request and subsequent response to client|
||95th percentile time between reception of client PUT request and subsequent response to client|
||100th percentile time between reception of client PUT request and subsequent response to client|
||Mean number of siblings encountered during all GET operations by this node within the last minute|
||Median number of siblings encountered during all GET operations by this node within the last minute|
||95th percentile of siblings encountered during all GET operations by this node within the last minute|
||100th percentile of siblings encountered during all GET operations by this node within the last minute|
||Total amount of memory used by Erlang processes|
||Number of Read Repairs this node has coordinated within the last minute|
||Number of Read Repairs this node has coordinated since startup|
||Number of Erlang processes|
||Number of requests this node has redirected to other nodes for coordination since startup|
||Number of protocol buffer connections in the last minute|
||Number of active protocol buffer connections|
Systems Metrics To Graph
|Available Disk Space|
Statistics and Monitoring Tools
There are many open source, self-hosted, and service-based solutions for aggregating and analyzing statistics and log data for the purposes of monitoring, alerting, and trend analysis on a Riak cluster. Some solutions provide Riak-specific modules or plugins as noted.
The following are solutions which customers and community members have reported success with when used for monitoring the operational status of their Riak clusters. Community and open source projects are presented along with commercial and hosted services.
Community and Open Source Tools
Riaknostic is a growing suite of diagnostic checks that can be run against your Riak node to discover common problems and recommend how to resolve them. These checks are derived from the experience of the Basho Client Services Team as well as numerous public discussions on the mailing list, IRC room, and other online media.
Riaknostic integrates into the
riak-admin command via a
diag subcommand, and is a great first step in the process of diagnosing and troubleshooting issues on Riak nodes.
Riak Control is Basho's REST-driven user-interface for managing Riak clusters. It is designed to give you quick insight into the health of your cluster and allow for easy management of nodes.
While Riak Control does not currently offer specific monitoring and statistics aggregation or analysis functionality, it does offer features which provide immediate insight into overall cluster health, node status, and handoff operations.
collectd gathers statistics about the system it is running on and stores them. The statistics are then typically graphed to find current performance bottlenecks, predict system load, and analyze trends.
Ganglia is a monitoring system specifically designed for large, high-performance groups of computers, such as clusters and grids. Customers and community members using Riak have reported success in using Ganglia to monitor Riak clusters.
Nagios is a monitoring and alerting solution that can provide information on the status of Riak cluster nodes, in addition to various types of alerting when particular events occur. Nagios also offers logging and reporting of events and can be used for identifying trends and capacity planning.
A collection of reusable Riak-specific scripts are available to the community for use with Nagios.
Riemann uses a powerful stream processing language to aggregate events from client agents running on Riak nodes, and can help track trends or report on events as they occur. Statistics can be gathered from your nodes and forwarded to a solution such as Graphite for producing related graphs.
A Riemann Tools project consisting of small programs for sending data to Riemann provides a module specifically designed to read Riak statistics.
OpenTSDB is a distributed, scalable Time Series Database (TSDB) used to store, index, and serve metrics from various sources. It can collect data at a large scale and graph these metrics on the fly.
Commercial and Hosted Service Tools
The following are some commercial tools which Basho customers have reported successfully using for statistics gathering and monitoring within their Riak clusters.
Circonus provides organization-wide monitoring, trend analysis, alerting, notifications, and dashboards. It can been used to provide trend analysis and help with troubleshooting and capacity planning in a Riak cluster environment.
Splunk is available as downloadable software or as a service, and provides tools for visualization of machine generated data such as log files. It can be connected to Riak's HTTP statistics
Splunk can be used to aggregate all Riak cluster node operational log files, including operating system and Riak-specific logs and Riak statistics data. These data are then available for real-time graphing, search, and other visualization ideal for troubleshooting complex issues and spotting trends.
Riak exposes numerous forms of vital statistic information which can be aggregated, monitored, analyzed, graphed, and reported on in a variety of ways using numerous open source and commercial solutions.
If you use a solution not listed here with Riak and would like to include it (or would otherwise like to update the information on this page), feel free to fork the docs, add it in the appropriate section, and send a pull request to the Riak Docs project.