This is intended to be a brief, objective and technical comparison of Riak and Cassandra. The Cassandra version described is 1.2.x. The Riak version described is Riak 1.2.x. If you feel this comparison is unfaithful at all for whatever reason, please fix it or send an email to firstname.lastname@example.org.
At A Very High Level
- Both Riak and Cassandra are Apache 2.0 licensed databases based on Amazon’s Dynamo paper.
- Riak is a faithful implementation of Dynamo, with the addition of functionality like links, MapReduce, indexes, full-text Search. Cassandra departs from the Dynamo paper slightly by omitting vector clocks and moving from partition-based consistent hashing to key ranges, while adding functionality like order-preserving partitioners and range queries.
- Riak is written primarily in Erlang with some bits in C. Cassandra is written in Java.
The table below gives a high level comparison of Riak and Cassandra features/capabilities. To keep this page relevant in the face of rapid development on both sides, low level details are found in links to Riak and Cassandra online documentation.
|Data Model||Riak stores key/value pairs in a higher level namespace called a bucket.||Cassandra’s data model resembles column storage, and consists of Keyspaces, Column Families, and several other parameters.|
|Storage Model||Riak has a modular, extensible local storage system which lets you plug-in a backend store of your choice to suit your use case. The default backend is Bitcask. backend API.||Cassandra’s write path starts with a write to a commit log followed by a subsequent write to an in-memory structure called a memtable. Writes are then batched to a persistent table structure called a sorted string table (SST).|
|Data Access and APIs||Riak offers two primary interfaces (in addition to raw Erlang access):||Cassandra provides various access methods including a Thrift API, CQL (Cassandra Query Language), and CLI.|
|Query Types and Query-ability||There are currently four ways to query data in Riak
||Cassandra offers various ways to query data:|
|Data Versioning and Consistency||Riak uses a data structure called a vector clock to reason about causality and staleness of stored values. Vector clocks enable clients to always write to the database in exchange for consistency conflicts being resolved at read time by either application or client code. Vector clocks can be configured to store copies of a given datum based on size and age of said datum. There is also an option to disable vector clocks and fall back to simple time-stamp based “last-write-wins”.||Cassandra uses timestamps at the column family level to determine the most-recent value when doing read requests. There is no built-in way to do versioning of data.||Concurrency||In Riak, any node in the cluster can coordinate a read/write operation for any other node. Riak stresses availability for writes and reads, and puts the burden of resolution on the client at read time.||All nodes in Cassandra are peers. A client read or write request can go to any node in the cluster. When a client connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.|
|Replication||Riak’s replication system is heavily influenced by the Dynamo Paper and Dr. Eric Brewer’s CAP Theorem. Riak uses consistent hashing to replicate and distribute N copies of each value around a Riak cluster composed of any number of physical machines. Under the hood, Riak uses virtual nodes to handle the distribution and dynamic rebalancing of data, thus decoupling the data distribution from physical assets.||Replication in Cassandra starts when a user chooses a partitioner. Partitioners include Random Partitioner (which also relies on consistent hashing for data storage) and various Ordered Partitioner options. Under the hood, physical nodes are assigned tokens which determine a nodes’s position on the ring and the range of data for which it’s responsible.|
|Scaling Out and In||Riak allows you to elastically grow and shrink your cluster while evenly balancing the load on each machine. No node in Riak is special or has any particular role. In other words, all nodes are masterless. When you add a physical machine to Riak, the cluster is made aware of its membership via gossiping of ring state. Once it’s a member of the ring, it’s assigned an equal percentage of the partitions and subsequently takes ownership of the data belonging to those partitions. The process for removing a machine is the inverse of this. Riak also ships with a comprehensive suite of command line tools to help make node operations simple and straightforward.||Cassandra allows you to add new nodes dynamically with the exception of manually calculating a node’s token (though users can elect to let Cassandra calculate this). It’s recommended that you double the size of your cluster to add capacity. If this isn’t feasible, you can elect to either add a number of nodes (which requires token recalculation for all existing nodes), or to add one node at a time, which means leaving the initial token blank and “will probably not result in a perfectly balanced ring but it will alleviate hot spots”.|
|Multi-Datacenter Replication||Riak features two distinct types of replication. Users can replicate to any number of nodes in one cluster (which is usually contained within one datacenter over a LAN) using the Apache 2.0 database. Riak Enterprise, Basho’s commercial extension to Riak, is required for Multi-Datacenter deployments (meaning the ability to run active Riak clusters in N datacenters).|
|Graphical Monitoring/Admin Console||Riak ships with Riak Control, an open source graphical console for monitoring and managing Riak clusters.||Datastax distributes the DataStax OpsCenter, a graphical user interface for monitoring and administering Cassandra clusters. This includes a free version available for production use, as well as a for-pay version with additional features.|