Cassandra lowers the barriers to big data

An InfoWorld Test Center overview of the nuts and bolts of Cassandra: A free, open source NoSQL datastore that supports rapid and massive growth

By Rick Grehan

Apache Cassandra is a free, open source NoSQL database designed to manage very large data sets (think petabytes) across large clusters of commodity servers. Among many distinguishing features, Cassandra excels at scaling writes as well as reads, and its “master-less” architecture makes creating and expanding clusters relatively straightforward. For organizations seeking a data store that can support rapid and massive growth, Cassandra should be high on the list of options to consider.

Cassandra comes from an auspicious lineage. It was influenced not only by Google’s Bigtable, from which it inherits its data architecture, but also Amazon’s Dynamo, from which it borrows its distribution mechanisms. Like Dynamo, nodes in a Cassandra cluster are completely symmetrical, all having identical responsibilities. Cassandra also employs Dynamo-style consistent hashing to partition and replicate data. (Dynamo is Amazon’s highly available key-value storage system, on which DynamoDB is based.)

How Cassandra works

Cassandra’s impressive hierarchy of caching mechanisms and carefully orchestrated disk I/O ensures speed and data safety. Its storage architecture is similar to a log-structured merge tree: Write operations are sent first to a persistent commit log (ensuring a durable write), then to a write-back cache called a memtable. When the memtable fills, it is flushed to an SSTable (sorted string table) on disk. All disk writes are appends — large sequential writes, not random writes — and therefore very efficient. Periodically, the SSTable files are merged and compacted.

A Cassandra cluster is organized as a ring, and it uses a partitioning strategy to distribute data evenly. The preferred partitioner is the RandomPartitioner, which generates a 128-bit consistent hash to determine data placement. The partitioner is assisted by another component called a “snitch,” which maps between a node’s IP address and its physical location in a rack or data center.

When Cassandra writes data, that data is written to multiple nodes so that it remains available in the event of node failure. The nodes to which a given data element is written are called “replica nodes.” Cassandra uses the snitch to ensure that the replica nodes for any particular piece of information are not in the same rack. Otherwise, if the rack were to fail, the data element and all its replica copies would be lost.

Should one or more nodes in a cluster become overutilized, Cassandra rebalances the cluster with the aid of “virtual nodes,” or vnodes. An entirely logical construct, a vnode is essentially a container for a range of database rows. Because each physical node is assigned multiple vnodes, Cassandra can rebalance simply by moving a virtual node from an overloaded cluster member to less burdened members. Using virtual nodes makes load balancing more efficient because it allows Cassandra to rebalance by moving small amounts of data from multiple sources to a destination.

To maintain write throughput in the face of node failures, Cassandra uses “hinted handoffs.” A node receiving a write request will attempt to deliver the request to the replica node responsible for the data. If that fails, the recipient node (referred to as the “coordinator node”) will save the request as a “hint” — a reminder to replay the write operation when the unreachable replica node becomes available. If the coordinator node knows beforehand that the replica node is unreachable, the hint is saved immediately.

Hinted handoffs are one of Cassandra’s consistency repair features. Another, called “read repair,” comes into play during read request processing. Depending on the consistency level chosen (explained below), Cassandra may satisfy a read request by reading only one of the replica nodes. Even so, it will issue background reads to all the replica nodes, and verify that all have the latest version of the data. Those that don’t are sent write operations to ensure that all copies of the data are up-to-date and consistent.

Consistency or speed

A prominent benefit of an RDBMS is its adherence to ACID — atomicity, consistency, isolation, and durability — principles, which guarantees repeatable, deterministic behavior in a multiclient setting and helps ensure data safety in spite of system failure. Nonrelational databases like Cassandra eschew ACID guarantees on the basis that they become performance-limiting as the database scales in both quantity of data and I/O requests.

Cassandra is described as being “eventually consistent.” When data is written to Cassandra, that data is not necessarily written simultaneously on all replica nodes. As described earlier, some cluster members might be temporarily unreachable. However, hinted handoffs ensure all nodes eventually catch up, and the system becomes consistent. Similarly, read repairs catch and correct inconsistencies when the data moves in the other direction, from Cassandra to the outside world.

This notion that different nodes in a cluster might possess inconsistent copies of a given data element might make you uneasy. The good news is that you can tune Cassandra’s consistency level. For instance, you can control the level of consistency that a write operation has achieved — how many replica nodes have written the data — before the write is acknowledged as successful to the issuing client application.

Similarly, on read operations, you can control how many replica nodes have responded before the response is returned to the client. This tunable consistency level ranges from Any, which means the request completes if any node responds, to All, which means the request only completes if all replica nodes have responded. Midway between Any and All are consistency levels such as Quorum, which allows requests to complete if a majority of replica nodes have responded. Cassandra’s tunable consistency is a powerful feature that lets you balance speed and consistency or trade one for the other. Want speed? Pick Any. Want full consistency? Pick All.

Because Cassandra is distributed, a cluster’s members require a mechanism for discovering one another and communicating state information. This is where Cassandra’s Gossip protocol comes in. As you might suspect, Gossip gets its name from the human activity of passing information throughout a group via apparently random, person-to-person conversations.

Certain nodes in a cluster are designated as “seed” nodes. Each second, a timer on a Cassandra node fires, initiating communication with two or three randomly selected nodes in the cluster, one of which must be a seed node. Consequently, seed nodes will tend to have the most up-to-date view of a cluster. (When a new node is added to a cluster, it first contacts a seed node.)

Cassandra works to keep Gossip communication efficient. Each node maintains two sorts of states. HeartBeatState tracks the node’s version number, which is incremented any time information on the node has changed, and how often the node was restarted. ApplicationState tracks the operational state of the node (such as the current load). Nodes exchange digests of HeartBeatState information with one another. If differences are found, the nodes then exchange digests of ApplicationState info, and ultimately the ApplicationState data itself. In addition, the Gossip algorithm first seeks to resolve differences that are “farther apart” (in terms of version numbers), since those are more likely to embody the widest inconsistencies.

Working with Cassandra

RDBMS users familiar with SQL should feel right at home with CQL, the Cassandra Query Language, which can be executed from the Python-based Cassandra shell utility (cqlsh) or through any of several client drivers. Client drivers are available from websites like Planet Cassandra, where you’ll find CQL-enabled drivers for Java, C#, Node.js, PHP, and others.

In the past, drivers communicated with a Cassandra cluster using a Thrift API — Thrift being a framework for creating what amounts to language-independent remote procedure calls for client and server. Cassandra’s Thrift API is now considered a legacy feature, as the CQL specification defines not only the CQL language, but an on-the-wire communication protocol as well.

CQL’s syntax resembles its relational cousin’s. It has SELECT, INSERT, UPDATE, and DELETE statements, and these are accompanied by FROM and WHERE clauses. In addition, CQL’s data types are what you would expect. You’ll find integers, floats and doubles, blobs, and more. Of course, there are differences. For one, CQL has no JOIN operation. And when you write a FROM clause, you specify column families — though, as of the latest version of CQL, the term “table” is used in place of “column family.” CQL also lets you specify the desired consistency level for any operation, but its real benefit is that it is a data management language quickly grasped by relational programmers, and is independent of a specific programming API.

Installing Cassandra is reasonably straightforward, particularly if you download the DataStax Community edition, which bundles a Web-based management application called OpsCenter. I downloaded and installed the tarball version of Cassandra on my Ubuntu Linux system (the apt-get version for some reason refused to install) and found that the real work lies in configuring a Cassandra cluster. The configuration.yaml file holds scads of tunable parameters for the node and its cluster.

For example, you can set the number of tokens that will be assigned to the node, which controls the proportion of data (relative to other nodes) that the node will be responsible for. (This is useful if your cluster is composed of heterogeneous hardware because more powerful members can be configured to handle heavier loads.) Happily, for a small trial installation, you need only configure the listening IP address for the current node and the IP addresses of the cluster’s seed nodes.

OpsCenter runs a server process on your management host that communicates with agent processes executing on the cluster’s nodes. The agents gather usage and performance information and send it to the server, which provides a browser-based user interface for viewing the aggregated results. With OpsCenter, you can browse data, examine throughput graphs, manage column families, initiate cluster rebalancing, and so on. (As an aside, I was unable to get OpsCenter working successfully on my Linux installation. The DataStax Community Edition installation on Windows worked, but only partially, it being unable to connect to the agent service.)

While documentation — primarily in the form of FAQs, wikis, and blogs — exists on the Apache Cassandra site and the Planet Cassandra site, DataStax is the most comprehensive source for Cassandra documentation and tutorials. In fact, Planet Cassandra’s Getting Started page more or less points you to the DataStax pages.

DataStax maintains documentation of both current and previous versions; as Cassandra is updated, you can troubleshoot any earlier installations you continue to run. The Web pages are well hyperlinked and provide plenty of diagrams. Along with video tutorials, you’ll also find reference guides for Java and C# drivers, as well as developer blogs on Cassandra internals.

Until recently, Cassandra provided no transactional capabilities. However, the latest release of Cassandra (version 2.0) adds “lightweight transactions” that employ an atomic “compare and set” architecture. In CQL, this is manifested as a conditional IF clause on INSERT and UPDATE commands. The data is modified if a particular condition is true. You can imagine a CQL INSERT statement that will only add a new row if the row does not exist, and the presence of the transactional IF test will guarantee that the INSERT is atomic for the database.

Cassandra 2.0 also improves response performance with “eager retries.” If a given replica is slow to respond to a read request, Cassandra will send that request to other replicas if there’s a chance the other replicas might respond prior to the request timeout. With version 2.0, Cassandra now handles the removal of stale index entries “lazily.” In the past, stale entries were cleaned up immediately, which required a synchronization lock. The new technique avoids the throughput-constricting lock.

While Cassandra is a complicated system, its symmetrical treatment of cluster nodes makes it surprisingly easy to get up and running. The SQL-like nature of CQL is a great benefit, making it quicker and easier for developers moving from RDBMS environments to become productive.

Nevertheless, the learning curve for Cassandra is significant. It’s a good idea to set up a small to modest development cluster and do plenty of experimenting, particularly with your data schema and configuration parameters. Performance issues can become significant as the application scales up.

This story, “Cassandra lowers the barriers to big data” was originally published by InfoWorld.