Download Riak 2.0 RC1

LevelDB

Overview

eLevelDB is an Erlang application that encapsulates LevelDB, an open source on-disk key-value store written by Google Fellows Jeffrey Dean and Sanjay Ghemawat. LevelDB is a relatively new entrant into the growing list of key/value database libraries but it has some very interesting qualities that we believe make it an ideal candidate for use in Riak. LevelDB's storage architecture is more like BigTable's memtable/sstable model than it is like Bitcask. This design and implementation brings the possibility of a storage engine without Bitcask's RAM limitation.

Riak 1.2 introduced changes in eLevelDB that allow users to tune LevelDB performance for “large data” environments typical in Riak deployments.

Strengths:

Weaknesses:

Installing eLevelDB

Riak ships with eLevelDB included within the distribution, so there is no separate installation required. However, Riak is configured by default to use the Bitcask storage engine. To switch to eLevelDB set the storage_backend variable in app.config to riak_kv_eleveldb_backend.

{riak_kv, [
    {storage_backend, riak_kv_eleveldb_backend},

Configuring eLevelDB

eLevelDb's default behavior can be modified by adding/changing parameters in the eleveldb section of the app.config. The Key Parameters section below details the parameters you'll use to modify eLevelDB. The Parameter Planning section gives a step-by-step example illustrating how to choose parameter values based on your application requirements.

The configuration values that can be set in your app.config for eLevelDB are as follows:

 %% LevelDB Config
 {eleveldb, [
     %% Required. Set to your data storage directory
     {data_root, "/var/lib/riak/leveldb"},

     %% Memory usage per vnode

     %% Maximum number of files open at once per partition
     %% Default. You must calculate to adjust (see below)
     {max_open_files, 30},
     %% Default. You must calculate to adjust (see below)
     {cache_size, 8388608},

     %% Write performance, Write safety

     %% this is default, recommended
     {sync, false},
     %% this is default, recommended
     {write_buffer_size_min, 31457280},
     %% this is default, recommended
     {write_buffer_size_max, 62914560},

     %% Read performance

     %% Required, strongly recommended to be true
     {use_bloomfilter, true},
     %% Default. Recommended to be 4k
     {sst_block_size, 4096},
     %% Default. Recommended to be 16
     {block_restart_interval, 16},

     %% Database integrity

     %% Default. Strongly recommended to be true
     {verify_checksums, true},
     %% Default. Strongly recommended to be true
     {verify_compactions, true}
 ]},

Memory Usage per Vnode

The following values help adjust the memory usage per vnode.

Max Open Files

The max_open_files value is multiplied by 4 megabytes to create a file cache. The file cache may end up holding more or fewer files at any given moment due to variations in file metadata size. max_open_files applies to a single vnode, not to the entire server. See calculation details in Parameter Planning.

Where server resources allow, the value of max_open_files should exceed the count of .sst table files within the vnode's database directory. Memory budgeting for this variable is more important to random read operations than the cache_size discussed next.

The minimum max_open_files is 30. The default is also 30.

{eleveldb, [
    ...,
    {max_open_files, 30},
    ...
]}

Check your system’s open files limits

Due to the large number of open files used by this storage engine is it imperative that you review and properly set your system’s open files limits. If you are seeing an error that contains emfile then it is highly likely that you’ve exceeded the limits on your system for open files, read more about this later in the Tips & Tricks section to see how to fix this issue.

Cache Size

The cache_size determines the size of each vnode's block cache. The block cache holds data blocks that LevelDB has recently retrieved from .sst table files. Any given block contains one or more complete key/value pairs. The cache speeds up repeat access to the same key and potential access to adjacent keys.

Riak 1.2 bug

A bug exists in Riak 1.2 where read performance is dramatically reduced if this value is greater than the default 8,388,608. The bug was fixed in 1.3 and the cache performance tested well to values as high as 2 gigabytes.

The cache_size determines how much data LevelDB caches in memory. The more of your data set that can fit in-memory, the better LevelDB will perform. The LevelDB cache works in conjunction with your operating system and file system caches; do not disable or under-size them. If you are running a 64-bit Erlang VM, cache_size can safely be set above 2G assuming you have enough memory available. Unlike Bitcask, LevelDB keeps keys and values in a block cache, this allows for management of key spaces that are larger than available memory. eLevelDB creates a separate LevelDB instance for each partition of the cluster and so each partition will have its own cache.

We recommend that you set this to be 20-30% of available RAM (available means after subtracting RAM consumed by other services including the filesystem cache overhead from physical memory).

For example, take a cluster with 64 partitions running on 4 physical nodes with 16GB of RAM free on each. In a best case scenario, all the nodes are running, so a good cache size would be half the available RAM (8GB) divided by the number of expected active vnodes on each node, which would be 64/4 = 16. That's 536870912 bytes (512MB) per vnode

Best Case:

 (Number of free GBs / 2) * (1024 ^ 3)
---------------------------------------- = Cache Size
(Number of partitions / Number of nodes)

But in reality, a node may fail. What happens to this cluster when a physical node fails? The 16 vnodes that were managed by that node, are now handled by the remaining active nodes in the cluster (3 in this case). Now, rather than 16 vnodes, each node will be handling approximately 22 vnodes (16 + some number of fallback vnodes they are managing on behalf of the failed node). With the cache size set for the optimal case, the total cache is now consuming 22 * 512MB = 11GB instead of the expected 16 * 512MB = 8GB. The total available memory is now too heavily weighted towards cache. You'll want to add in some wiggle room for this case.

Real Life:

    (Number of free GBs / 2) * (1024 ^ 3)
---------------------------------------------- = Cache Size
(Number of partitions / (Number of nodes - F))

F = Number of nodes that can fail before impacting cache use of memory
    on remaining systems

If we wanted 1 physical node to be able to fail before impacting cache memory utilization, We'd use an F = 1. Now we're dividing half the available memory (still 8GB) by (64/(4-1)) = 21.3333 (round up to 22!). This turns out to be 390451572 or 372MB. Now each physical node can cache up to 22 vnodes before hitting the 50% memory usage mark. If a second node went down, this cluster would feel that.

You might ask yourself, 'why only 50% of available RAM?' and the answer is that the other physical RAM will be used by the operating system, the Erlang VM, and by the operating system's filesystem cache (the buffer pool on Linux). That filesystem cache will translate into much faster access times (read and write) for your cluster. A second reason for the buffer is to avoid paging memory to disk. Virtual memory helps prevent failure due to out of memory conditions, but the costs of paging memory to/from disk are very high and cause noticeable impact on cluster performance.

Default: 8MB

{eleveldb, [
    ...,
    {cache_size, 8388608}, %% 8MB default cache size per-partition
    ...
]}

Write Performance/Write safety

These values can be adjusted to tune between write performance, and write safety. The general rule is, the faster you want your writes perform (e.g. a larger buffer), the less safe they are (more data could be lost in a crash).

Sync

This parameter defines how new key/value data is placed in the recovery log. The recovery log is only used if the Riak program crashes or the server loses power unexpectedly. The parameter's original intent was to guarantee that each new key / value was written to the physical disk before LevelDB responded with write good. The reality in modern servers is that many layers of data caching exist between the database program and the physical disks. This flag influences only one of the layers.

Setting this flag true guarantees the write operation will be slower, but does not actually guarantee the data was written to the physical device. The data has likely been flushed from the operating system cache to the disk controller. The disk controller may or may not save the data if the server power fails. Setting this flag to false allows faster write operations and depends upon the operating system's memory mapped I/O conventions to provide reasonable recovery should the Riak program fail, but not if server power fails.

Default: false

{eleveldb, [
    ...,
    {sync, false},  %% do not write()/fsync() every time
    ...
]}

Write Buffer Size

Each vnode first stores new key/value data in a memory based write buffer. This write buffer is in parallel to the recovery log mentioned in the “sync” parameter. Riak creates each vnode with a randomly sized write buffer for performance reasons. The random size is somewhere between write_buffer_size_min and write_buffer_size_max.

If unspecified in app.config, eLevelDB will default to a write_buffer_size_min of 31,457,280 Bytes (30 MB) and write_buffer_size_max of 62,914,560 Bytes (60 MB). In this case, the average write buffer will be 47,185,920 bytes (45 MB).

Other portions of Basho's LevelDB tuning assume these two values have the default values. Changing the values higher or lower will likely impact overall write throughput in a negative fashion.

{eleveldb, [
    ...,
    {write_buffer_size_min, 31457280},  %% 30 MB in bytes
    {write_buffer_size_max, 62914560},  %% 60 MB in bytes
    ...
]}

If you choose to change the write buffer size by setting write_buffer_size_min and write_buffer_size_max, write_buffer_size_min must be at least 30 MB, and write_buffer_size_min should be about half the size of write_buffer_size_max.

If you wish to set all write buffers to the same size, use the write_buffer_size parameter. This will override the write_buffer_size_min and write_buffer_size_max parameters. This is not recommended.

Larger write buffers increase performance, especially during bulk loads. Up to two write buffers may be held in memory at the same time, so you may wish to adjust this parameter to control memory usage.

Read Performance

These parameters can be set/adjusted to increase read performance.

The block_size and block_restart_interval parameters determine how LevelDB will organize the key space within each .sst table file. The defaults are very good for all researched cases and therefore recommended.

Bloom filter

Each database .sst table file includes an optional Bloom filter that is highly effective in shortcutting queries for keys that are not present. The Bloom filter typically increases the size of an .sst table file by about 2%. To disable this feature, this option must be set to false in app.config explicitly.

Default: true

{eleveldb, [
    ...,
    %% Disabling the Bloom filter in LevelDB will marginally decrease
    %% file size at the expense of increased lookup time for misses.
    %% Running with the Bloom filter disabled is generally not 
    %% recommended and must be tested aggressively with your workload 
    %% to determine the impacts to your application's performance. 
    {use_bloomfilter, false},   
    ...
]}

Block Size

sst_block_size defines the size threshold for a block / chunk of data within one .sst table file. Each new block gets an index entry in the .sst table file's master index.

Default: 4096 (4K)

{eleveldb, [
    ...,
    {sst_block_size, 4096},  %% 4K blocks
    ...
]}

Block Restart Interval

block_restart_interval defines the key count threshold for a new key entry in the key index for a block.

Most clients should leave this parameter alone.

Default: 16

{eleveldb, [
      ...,
      {block_restart_interval, 16}, %% # of keys before restarting delta encoding
      ...
]}

Database Integrity

The verify_checksums and verify_compactions options control whether or not each data block from the disk is first validated via a CRC32c checksum. The tradeoff is this: disk read operations are slightly faster without the CRC32c validation, but it is quite likely Riak will crash if a corrupted block is passed from the disk to LevelDB unvalidated.

Verify Checksums

verify_checksums controls whether or not validation occurs when Riak requests data from the LevelDB database on behalf of the user.

Default: true

{eleveldb, [
    ...,
    {verify_checksums, true}, %% make sure data is what we expected it to be
    ...
]}

Verify Compaction

verify_compaction controls whether or not validation occurs when LevelDB reads data as part of its background compaction operations.

Default: true

{eleveldb, [
    ...,
    {verify_compaction, true},
    ...
]}

Parameter Planning

The following steps walk you through setting parameters and evaluating how much working memory (i.e. RAM) you'll need for a given LevelDB implementation.

Step 1: Calculate Available Working Memory

Current Unix-like systems (Linux / Solaris / SmartOS) use physical memory that is not allocated by programs as buffer space for disk operations. In Riak 1.2, LevelDB is modeled to depend upon this Operating System (OS) buffering. You must leave 25-50% of the physical memory available for the operating system (25-35% if servers have Solid State Drive (SSD) arrays, 35-50% if servers have spinning hard drives).

LevelDB working memory is calculated simply as the memory not reserved for the OS.

leveldb_working_memory = server_physical_memory * (1 - percent_reserved_for_os)

Example:

If a server has 32G RAM and we wish to reserve 50%,

leveldb_working_memory = 32G * (1 - .50) = 16G

Step 2: Calculate Working Memory per vnode

Riak 1.2 configures/assigns memory allocations by vnode. To calculate the vnode working memory, divide LevelDB's total working memory by the number of vnodes.

vnode_working_memory = leveldb_working_memory / vnode_count

Example:

If a physical server contains 64 vnodes,

vnode_working_memory = 16G / 64 = 268,435,456 Bytes per vnode

Step 3: Estimate Memory Used by Open Files

There are many variables that determine the exact memory any given file will require when open. The formula below gives an approximation that should be accurate within 10% for moderately large LevelDB implementations.

open_file_memory = (max_open_files-10) * (184 + (average_sst_filesize/2048) * (8 + ((average_key_size+average_value_size)/2048 +1) * 0.6)

If a physical server contains 64 vnodes and the parameter values in the table below,

open_file_memory =  (150-10)* (184 + (314,572,800/2048) * (8+((28+1024)/2048 +1)*0.6 = 191,587,760 Bytes

Example:

Parameter Value
max_open_files 150
average_sst_filesize 314,572,800 Bytes
average_key_size 28 Bytes
average_value_size 1,024 Bytes
Total 191,587,760 Bytes


Step 4: Calculate Average Write Buffer

Calculate the average of write_buffer_size_min and write_buffer_size_max (see write buffer size for more on these parameters). The defaults are 31,457,280 Bytes (30 MB) and 62,914,560 Bytes (60 MB), respectively. Therefore the default average is 47,185,920 Bytes (45 MB).

Step 5: Calculate vnode Memory Used

The estimated amount of memory used by a vnode is the sum of:

Example:

Parameter Bytes
average write buffer size 47,185,920
cache size 8,388,608
open files 191,587,760
management files 20,971,520
Total 268,133,808 (~255 MB)

Step 6: Compare Step 2 and Step 5 and Adjust Variables

Example: In Step 2 we calculated a working memory per vnode of 268,435,456 Bytes. In Step 5, we estimated vnodes would consume approximately 268,133,808 Bytes. Step 2 and step 5 are within 301,648 Bytes (~300 kB) of each other. This is exceptionally close, but happens to be more precise than really needed. The values are good enough when they are within 5%.

The above calculations are automated in this memory model spreadsheet.

Tuning LevelDB

While eLevelDB can be extremely fast for a durable store, its performance varies based on how you tune it. All the configuration is exposed via application variables in the eLeveldb application scope.

Tips & Tricks:

/dev/sda5    /data           ext3    noatime  1 1
/dev/sdb1    /data/inno-log  ext3    noatime  1 2

Recommended Settings

Below are general configuration recommendations for Linux distributions. Individual users may need to tailor these settings for their application.

sysctl

For production environments, please see System Performance Tuning for the recommended /etc/sysctl.conf settings.

Block Device Scheduler

Beginning with the 2.6 kernel, Linux gives you a choice of four I/O elevator models. We recommend using the NOOP elevator. You can do this by changing the scheduler on the Linux boot line: elevator=noop.

ext4 Options

The ext4 filesystem defaults include two options that increase integrity but slow performance. Because Riak's integrity is based on multiple nodes holding the same data, these two options can be changed to boost LevelDB's performance. We recommend setting: barrier=0 and data=writeback.

CPU Throttling

If CPU throttling is enabled, disabling it can boost LevelDB performance in some cases.

No Entropy

If you are using https protocol, the 2.6 kernel is widely known for stalling programs waiting for SSL entropy bits. If you are using https, we recommend installing the HAVEGE package for pseudorandom number generation.

clocksource

We recommend setting “clocksource=hpet” on your linux kernel's boot line. The TSC clocksource has been identified to cause issues on machines with multiple physical processors and/or CPU throttling.

swappiness

We recommend setting vm.swappiness=0 in /etc/sysctl.conf. The vm.swappiness default is 60, which is aimed toward laptop users with application windows. This was a key change for MySQL servers and is often referenced in database performance literature.

FAQ

Implementation Details

LevelDB is a Google sponsored open source project that has been incorporated into an Erlang application and integrated into Riak for storage of key/value information on disk. The implementation of LevelDB is similar in spirit to the representation of a single Bigtable tablet (section 5.3).

How “Levels” Are Managed

LevelDB is a memtable/sstable design. The set of sorted tables are organized into a sequence of levels. Each level stores approximately ten times as much data as the level before it. The sorted table generated from a flush is placed in a special young level (also called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged together with all of the overlapping level-1 files to produce a sequence of new level-1 files (a new level-1 file is created for every 2MB of data.)

Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping key ranges. Consider level number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e. 10MB for level-1, 100MB for level-2, …), one file in level-L, and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive disk seeks).

When the size of level L exceeds its limit, LevelDB will compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Compactions from level-0 to level-1 are treated specially because level-0 is special (files in it may overlap each other). A level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.

A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. LevelDB will switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). LevelDB will also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).

Compactions for a particular level rotate through the key space. In more detail, for each level L, LevelDB remembers the ending key of the last compaction at level L. The next compaction for level L will pick the first file that starts after this key (wrapping around to the beginning of the key space if there is no such file).

Level-0 compactions will read up to four 1MB files from level-0, and at worst all the level-1 files (10MB) (i.e., LevelDB will read 14MB and write 14MB in that case).

Other than the special level-0 compactions, LevelDB will pick one 2MB file from level L. In the worst case, this will overlap with approximately 12 files from level L+1 (10 because level-(L+1) is ten times the size of level-L, and another two at the boundaries since the file ranges at level-L will usually not be aligned with the file ranges at level-L+1). The compaction will therefore read 26MB, write 26MB. Assuming a disk IO rate of 100MB/s, the worst compaction cost will be approximately 0.5 second.

If we throttle the background writing to a reasonably slow rate, for instance 10% of the full 100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at 10MB/s, LevelDB might build up lots of level-0 files (~50 to hold the 5*10MB). This may significantly increase the cost of reads due to the overhead of merging more files together on every read.

Compaction

Levels are compacted into ordered data files over time. Compaction first computes a score for each level as the ratio of bytes in that level to desired bytes. For level 0, it computes files / desired files instead. The level with the highest score is compacted.

When compacting L0 the only special case to consider is that after picking the primary L0 file to compact, it will check other L0 files to determine the degree to which they overlap. This is an attempt to avoid some I/O, we can expect L0 compactions to usually if not always be “all L0 files”.

See the PickCompaction routine in 1 for all the details.

Comparison of eLevelDB and Bitcask

LevelDB is a persistent ordered map; Bitcask is a persistent hash table (no ordered iteration). Bitcask stores keys in memory, so for databases with large number of keys it may exhaust available physical memory and then swap into virtual memory causing a severe slow down in performance. Bitcask guarantees at most one disk seek per look-up. LevelDB may have to do a small number of disk seeks. For instance, a read needs one disk seek per level. If 10% of the database fits in memory, LevelDB will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, LevelDB will need two seeks.

Recovery

LevelDB never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). LevelDB recovery code uses checksums to detect this and will skip the incomplete records.

eLevelDB Database Files

Below there are two directory listings showing what you would expect to find on disk when using eLevelDB. In this example we use a 64 partition ring which results in 64 separate directories, each with their own LevelDB database.

leveldb/
|-- 0
|   |-- 000003.log
|   |-- CURRENT
|   |-- LOCK
|   |-- LOG
|   `-- MANIFEST-000002
|-- 1004782375664995756265033322492444576013453623296
|   |-- 000005.log
|   |-- CURRENT
|   |-- LOCK
|   |-- LOG
|   |-- LOG.old
|   `-- MANIFEST-000004
|-- 1027618338748291114361965898003636498195577569280
|   |-- 000005.log
|   |-- CURRENT
|   |-- LOCK
|   |-- LOG
|   |-- LOG.old
|   `-- MANIFEST-000004

... etc ...

`-- 981946412581700398168100746981252653831329677312
    |-- 000005.log
    |-- CURRENT
    |-- LOCK
    |-- LOG
    |-- LOG.old
    `-- MANIFEST-000004

64 directories, 378 files

After performing a large number “put” (write) operations the Riak cluster running eLevelDB will look something like this.

$ tree leveldb/
leveldb/
|-- 0
|   |-- 000118.sst
|   |-- 000119.sst
|   |-- 000120.sst
|   |-- 000121.sst
|   |-- 000123.sst
|   |-- 000126.sst
|   |-- 000127.log
|   |-- CURRENT
|   |-- LOCK
|   |-- LOG
|   |-- LOG.old
|   `-- MANIFEST-000125
|-- 1004782375664995756265033322492444576013453623296
|   |-- 000120.sst
|   |-- 000121.sst
|   |-- 000122.sst
|   |-- 000123.sst
|   |-- 000125.sst
|   |-- 000128.sst
|   |-- 000129.log
|   |-- CURRENT
|   |-- LOCK
|   |-- LOG
|   |-- LOG.old
|   `-- MANIFEST-000127
|-- 1027618338748291114361965898003636498195577569280
|   |-- 000003.log
|   |-- CURRENT
|   |-- LOCK
|   |-- LOG
|   `-- MANIFEST-000002

... etc ...

`-- 981946412581700398168100746981252653831329677312
    |-- 000003.log
    |-- CURRENT
    |-- LOCK
    |-- LOG
    `-- MANIFEST-000002

64 directories, 433 files