21 January 2011

Compression at "Internet" scale (originally posted to StorageMonkeys November 22, 2009)

One of the things I've learned, having been in more traditional "Enterprise" environments and "Internet" companies is that the latter have much larger scale issues, with respect to storage, by an order of magnitude or two, than the former.


Fortunately, there's also a difference in the nature of the data, such that the most voluminous (and, arguably, most valuable) data, web access logs, are highly compressible (5-20x) with the right algorithm. Compression is important at this scale for reducing I/O and increasing speed of access, not the number of bits "spinning" on platters.

A solution must work in real-time. There is some flexibility in that average load is rarely anywhere near peak load. However, my experience is that paying for unused capacity is better than depending on catchingup on backlogs during off-peak times. In the former case, the consequences of a poor estimate are finite and predictable but not so in the latter.

Assuming one wants to use the data, a solution must decompress at least as fast as it compresses. I haven't run into this as a problem, since the readily available algorithms easily meet such a requirement. A possible issue could be with parallel processing of the compression but centralized processing of the decompression, such as to load into decision support database.

Performance has to be no more than O(n) for memory (distributed compressors) or O(n) for CPU (central compressor). Fortunately, the former appears easily satisfied by available algorithms, so long as "n" is log event volume, not average size of each log event

HTTP logs are extremly self-similar, so just throwing Lempel-Ziv at them is sub-optimal. Experimenting, although I've found descendants like LZMA do quite well (around 5x), that seems to be the top end, at a not particularly impressive speed. This may be great for general purpose compression but not for this special purpose.

Though they'll often have plenty of natural-language embedded within, large text compressors (such as those tuned for the Hutter Prize cf. http://mattmahoney.net/dc/text.html) aren't ideal, either. I speculate that this is due to a much higher incidence of abbreviations and numerals, but I'm hardly qualified.

Another possibility would be to configure/customize ones web server to log in a pre-compressed format. I generally reject this out of hand, because it removes much of the self-documenting nature of verbose logs. Moreover, it can't predict the future to determine the frequency of a current log event. To do so would mean maintaining a buffer, which may as well be on the disk of another server, the current situation. Perhaps more to the point, my operational philosophy discourages burdening something critical like a web server with something ancillary like log compression.

The best option I've found so far is the PPMd algorithm, primarily as implemented in softwarey by the 7zip package. Specifically, with order 7 and 1GB of memory, a modern CPU will compress my web logs 10:1 at 10MB/s. Its main disadvantages are being memory heavy, with an identical footprint for compression and decompression and lack of parallel implementation.

I don't yet have any good data, partly because of the fast pace of startups means the character of the logs I work with changes and partly because I rarely have the luxury of trying more than one method on the same data. However, once I do, I'll post some hard number comparisons between LZMA and PPMd with various tuning options.

Next year, look for my musings on compression of database redo/write-ahead logs.

No comments:

Post a Comment