So basic summary of the above is, create a CF where each row corresponds to a date range of data for a single metric. Often times, you’ll see people suggesting making each bucket being a days worth of data. If you’re storing 5 minute averages, then that’s 288 columns/row.
Here’s an example CF for storing network interface counter statistics:
create column family InterfaceStatDaily with column_type = Standard and key_validation_class = AsciiType and comparator = LongType and default_validation_class = LongType and comment = 'Storage for 5min interface stats';
So I create rows like:
InterfaceStatDaily['myhost|eth0|ifInOctets|20121001'] = [ [ 1349049600, 19245 ], [ 1349049900, 23475 ], [ 1349051200, 173445 ], ... ]
Would create a row for the ifInOctets counter on the network interface eth0 for myhost for Oct 01, 2012. The first value (at 00:00:00) is 19245 and the second value at 00:05:00 is 23475, etc. Notice I’m storing column names as LongTypes using Epoch time… this is much more disk space efficient then Time-UUID’s.
In my case, according to this calculator a column with a LongType name & value would be 31 bytes. However, we’re really only storing 12 bytes of data (4 bytes epoch + 8 byte value) so the vast majority of our disk space is overhead. In my case, a years worth of data for would be ~1.173TB/year (12 * 31 * 60 * 24 / 5 * 365 * 30000) just for the 5min values assuming 30K ports * 12 stats/port and doesn’t include replication (at RF=3 you’re talking nearly 3.6TB/year!), row/sstable overhead for indexes or bloom filters or creating aggregates. Even for a database like Cassandra which scales really well, this is stupid- especially if you’re trying to keep your nodes down to the 300-400GB recommended limit.
Using Vector Compression
The solution is to add a second Column Family:
create column family InterfaceStatCompressed with column_type = Standard and key_validation_class = AsciiType and comparator = LongType and default_validation_class = BytesType and comment = 'Vector compressed storage for 5min interface stats';
And then write your rows like this:
InterfaceStatCompressed['myhost|eth0|ifInOctets|2012'] = [ [ 1349049600, Vector[19245, 23475, 173445, ... ], [ 1349136000, Vector[823445, 1366, 2343461, ... ], ... ]
First you’ll notice the row key is slightly different, instead of a day, the last part of the row key is the year. The column name is the start of each day in Epoch time. The next thing, is for a given column, rather then storing a single value, we store a vector of a days worth of data as a byte stream of 2,304 bytes (8 bytes/value * 288 values = 2,304).
Note, I wrote the above as Vector[xxx,xxx,xxx,…], but really what that means is converting all your values to 64bit integers, concatenating them together and writing them as a single value like:
Which is the actual bytes for the first 3 values for the Vector[19245, 23475, 173445, …].
Note: Our first idea was to write the data as an AsciiType of comma separated array of values. But it turns out this is less efficient then the above solution unless your data is very sparse or is storing small values as a comma separated array of 32bit hexadecimal values takes up to 2,591 bytes.
The advantage here is that the amount of overhead you’re dealing with is greatly reduced because now there are only 365 columns/year instead of 105,120. Compacting a day’s worth of 5min values in a single column as a byte vector and storing a years worth of data this way would result in only ~305GB/year (before replication, etc). This would require 2,327bytes/col (15 + 8 + 2,304), but that would store a day’s worth of 5min values (288) a savings of 74%. (~850,000Bytes stat/interface/year or ~10MB interface/year). Even better news: column families like this are highly compressible so you’ll get even better results if you enable compression!
If you have sparse data, for missing individual values I just used a ‘magic’ value of 2**64-2 as a placeholder. If for some reason, the entire days worth of data is zero (a common occurrence for me), I store the single value 2**64-1 rather then a vector of 288 copies of 2**64-2.
Bringing It All Together
The downside is of the InterfaceStatCompressed Column Family is that you have to do a read/write of a full day’s worth of data. That is usually great for reads since most applications like graphing require reading multiple datapoints anyways, but for writing the current values as you poll them, having to constantly read the current days worth of data, add the new datapoint and write the new column would kill our write performance.
The solution and best balance then is to write the current values to InterfaceStatDaily as they come in, and then once a day read the previous day’s complete row and write it to InterfaceStatCompressed and then delete the original row. That gives you the best of both worlds: high read and write throughput and reasonable disk space usage.
Of course, the down side of this is now you have your data in two different CF’s with different layouts. Hence you’ll need to wrap whatever API access layer you’re already using (Hector, Pycassa, etc) to intelligently query the two CF’s based on whatever date range you’re querying and return the values in a single format. However, considering the significant disk space savings, I found this to be well worth doing!
Work arounds for Time-UUID
Note: Often you’ll see recommendations to use Time-UUD’s because if there is a chance of two values getting the same timestamp as one write will overwrite the other. This can happen for a click stream (two users clicking on the same link at the same time), but won’t happen for things like interface stats. If your timestamps can’t be written at some kind of consistent interval (like 5 minutes in this example) then the solution above probably won’t work for you.
Ok, so what if you need to write your data at Time-UUID or your timestamps aren’t written at a specific interval? Well you can still use this solution, but with a few changes:
- Specify InterfaceStatDaily to use TimeUUID as your column comparator
- When writing InterfaceStatCompressed sum() all the values for each 5 minute interval and write that at set intervals.
Obviously doing this you will loose sub-5minute resolution (or whatever time interval you choose) in the compressed data, but you won’t be storing all those extra timestamps. If your needs require high precision timestamps for recent data and less precision for historical data then this may be a great work around.
If you don’t need high precision for current data, then one way to limit the differences in your CF’s and the wrapping API’s is to use Epoch column names with Counters for InterfaceStatDaily. That way every time someone clicks on a URL (or whatever you’re counting) in a 5 minute time period, you just increment the appropriate counter. Note that counters may not work well for every use case, so be sure to research their limitations before using them.
The key thing is when you run your daily vector compression job, you want to balance your time interval resolution to how often you actually are getting data. Ie: if you’re getting writes approximately every 60 seconds, don’t use 5 second resolution or the vast majority of your storage space will be the 2**64-2 place holders indicating no data!