Product tutorials, how-tos, and fully-documented APIs.

Statistics and Monitoring

    Statistics from Riak

    Riak provides data related to current operating status, which includes statistics in the form of counters and histograms. These statistics are made available through the HTTP API via the /stats endpoint, or through the riak-admin status command.

    This page presents the most commonly monitored and gathered statistics, as well as numerous solutions for monitoring and gathering statistics that our customers and community report using successfully in Riak cluster environments. You can learn more about the specific Riak statistics provided in the Inspecting a Node documentation.

    Counters

    Riak provides counters for GETs, PUTs, read repairs, and other common operations. By default, these counters count either the number of operations occurring within the last minute, or for the runtime duration of the node.

    Gets and Puts

    GET/PUT counters are provided for both nodes and vnodes. These counters are commonly graphed over time for trend analysis, capacity planning, and so forth.

    Metric Description
    node_gets Number of GETs coordinated by this node within the last minute, including GETs to non-local vnodes on this node
    node_gets_total Number of GETs coordinated by this node since startup, including GETs to non-local vnodes
    node_puts Number of PUTs coordinated by this node, including PUTs to non-local vnodes on this node within the last minute
    node_puts_total Number of PUTs coordinated by this node since startup, including PUTs to non-local vnodes
    vnode_gets Number of GET operations coordinated by vnodes on this node within the last minute
    vnode_gets_total Number of GETs coordinated by local vnodes since node startup
    vnode_puts_total Number of PUTS coordinated by local vnodes since node startup

    Read Repairs

    Read repair counters are commonly graphed and monitored for abnormally high totals, which can be indicative of an issue.

    Metric Description
    read_repairs Number of read repair operations this node has coordinated in the last minute
    read_repairs_total Number of read repair operations this node has coordinated since the node was started

    Coordinated Redirection

    Counters representing the number of coordinated node redirection operations are provided in total since node startup.

    Metric Description
    coord_redirs_total Number of requests this node has redirected to other nodes for coordination since startup

    Statistics

    Riak provides statistics for a range of operations. By default, Riak provides the mean, median, 95th percentile, 99th percentile, and 100th percentile over a 60 second window.

    Finite State Machine Time

    Riak exposes Finite State Machine (FSM) time counters (node_get_fsm_time_* & node_put_fsm_time_*) that measure the amount of time in microseconds required to traverse the GET or PUT FSM code, offering a picture of general node health.

    GET FSM Object Size

    GET FSM Object Size (node_get_fsm_objsize_*) measures the size of objects flowing through this node's GET finite state machine (GET_FSM). The size of an object is obtained by summing the length of the object's bucket name, key, serialized vector clock, value, and the serialized metadata of each sibling.

    GET FSM Siblings

    GET FSM Sibling (node_get_fsm_siblings_*) provides a histogram (with 60 second window) of the number of siblings encountered by this node on the occasion of a GET request.

    Riak Metrics To Graph

    Metric Description
    node_get_fsm_objsize_mean Mean object size encountered by this node within the last minute
    node_get_fsm_objsize_median Median object size encountered by this node within the last minute
    node_get_fsm_objsize_95 95th percentile object size encountered by this node within the last minute
    node_get_fsm_objsize_100 100th percentile object size encountered by this node within the last minute
    node_get_fsm_time_mean Mean time between reception of client GET request and subsequent response to client
    node_get_fsm_time_median Median time between reception of client GET request and subsequent response to client
    node_get_fsm_time_95 95th percentile time between reception of client GET request and subsequent response to client
    node_get_fsm_time_100 100th percentile time between reception of client GET request and subsequent response to client
    node_put_fsm_time_mean Mean time between reception of client PUT request and subsequent response to client
    node_put_fsm_time_median Median time between reception of client PUT request and subsequent response to client
    node_put_fsm_time_95 95th percentile time between reception of client PUT request and subsequent response to client
    node_put_fsm_time_100 100th percentile time between reception of client PUT request and subsequent response to client
    node_get_fsm_siblings_mean Mean number of siblings encountered during all GET operations by this node within the last minute
    node_get_fsm_siblings_median Median number of siblings encountered during all GET operations by this node within the last minute
    node_get_fsm_siblings_95 95th percentile of siblings encountered during all GET operations by this node within the last minute
    node_get_fsm_siblings_100 100th percentile of siblings encountered during all GET operations by this node within the last minute
    memory_processes_used Total amount of memory used by Erlang processes
    read_repairs Number of Read Repairs this node has coordinated within the last minute
    read_repairs_total Number of Read Repairs this node has coordinated since startup
    sys_process_count Number of Erlang processes
    coord_redirs_total Number of requests this node has redirected to other nodes for coordination since startup
    pbc_connect Number of protocol buffer connections in the last minute
    pbc_active Number of active protocol buffer connections

    Systems Metrics To Graph

    Metric
    Available Disk Space
    IOWait
    Read Operations
    Write Operations
    Network Throughput
    Load Average

    Statistics and Monitoring Tools

    There are many open source, self-hosted, and service-based solutions for aggregating and analyzing statistics and log data for the purposes of monitoring, alerting, and trend analysis on a Riak cluster. Some solutions provide Riak-specific modules or plugins as noted.

    The following are solutions which customers and community members have reported success with when used for monitoring the operational status of their Riak clusters. Community and open source projects are presented along with commercial and hosted services.

    Community and Open Source Tools

    Riaknostic

    Riaknostic is a growing suite of diagnostic checks that can be run against your Riak node to discover common problems and recommend how to resolve them. These checks are derived from the experience of the Basho Client Services Team as well as numerous public discussions on the mailing list, IRC room, and other online media.

    Riaknostic integrates into the riak-admin command via a diag subcommand, and is a great first step in the process of diagnosing and troubleshooting issues on Riak nodes.

    Riak Control

    Riak Control is Basho's REST-driven user-interface for managing Riak clusters. It is designed to give you quick insight into the health of your cluster and allow for easy management of nodes.

    While Riak Control does not currently offer specific monitoring and statistics aggregation or analysis functionality, it does offer features which provide immediate insight into overall cluster health, node status, and handoff operations.

    collectd

    collectd gathers statistics about the system it is running on and stores them. The statistics are then typically graphed to find current performance bottlenecks, predict system load, and analyze trends.

    Ganglia

    Ganglia is a monitoring system specifically designed for large, high-performance groups of computers, such as clusters and grids. Customers and community members using Riak have reported success in using Ganglia to monitor Riak clusters.

    A Riak Ganglia module for collecting statistics from the Riak HTTP /stats endpoint is also available.

    Nagios

    Nagios is a monitoring and alerting solution that can provide information on the status of Riak cluster nodes, in addition to various types of alerting when particular events occur. Nagios also offers logging and reporting of events and can be used for identifying trends and capacity planning.

    A collection of reusable Riak-specific scripts are available to the community for use with Nagios.

    Riemann

    Riemann uses a powerful stream processing language to aggregate events from client agents running on Riak nodes, and can help track trends or report on events as they occur. Statistics can be gathered from your nodes and forwarded to a solution such as Graphite for producing related graphs.

    A Riemann Tools project consisting of small programs for sending data to Riemann provides a module specifically designed to read Riak statistics.

    OpenTSDB

    OpenTSDB is a distributed, scalable Time Series Database (TSDB) used to store, index, and serve metrics from various sources. It can collect data at a large scale and graph these metrics on the fly.

    A Riak collector for OpenTSDB is available as part of the tcollector framework.

    Commercial and Hosted Service Tools

    The following are some commercial tools which Basho customers have reported successfully using for statistics gathering and monitoring within their Riak clusters.

    Circonus

    Circonus provides organization-wide monitoring, trend analysis, alerting, notifications, and dashboards. It can been used to provide trend analysis and help with troubleshooting and capacity planning in a Riak cluster environment.

    Splunk

    Splunk is available as downloadable software or as a service, and provides tools for visualization of machine generated data such as log files. It can be connected to Riak's HTTP statistics /stats endpoint.

    Splunk can be used to aggregate all Riak cluster node operational log files, including operating system and Riak-specific logs and Riak statistics data. These data are then available for real-time graphing, search, and other visualization ideal for troubleshooting complex issues and spotting trends.

    Summary

    Riak exposes numerous forms of vital statistic information which can be aggregated, monitored, analyzed, graphed, and reported on in a variety of ways using numerous open source and commercial solutions.

    If you use a solution not listed here with Riak and would like to include it (or would otherwise like to update the information on this page), feel free to fork the docs, add it in the appropriate section, and send a pull request to the Riak Docs project.

    References