One of the things I've learned, having been in more traditional "Enterprise" environments and "Internet" companies is that the latter have much larger scale issues, with respect to storage, by an order of magnitude or two, than the former.
Fortunately, there's also a difference in the nature of the data, such
that the most voluminous (and, arguably, most valuable) data, web access
logs, are highly compressible (5-20x) with the right algorithm.
Compression is important at this scale for reducing I/O and increasing
speed of access, not the number of bits "spinning" on platters.
solution must work in real-time. There is some flexibility in that
average load is rarely anywhere near peak load. However, my experience
is that paying for unused capacity is better than depending on
catchingup on backlogs during off-peak times. In the former case, the
consequences of a poor estimate are finite and predictable but not so in
Assuming one wants to use the data, a solution must
decompress at least as fast as it compresses. I haven't run into this
as a problem, since the readily available algorithms easily meet such a
requirement. A possible issue could be with parallel processing of the
compression but centralized processing of the decompression, such as to
load into decision support database.
Performance has to be no
more than O(n) for memory (distributed compressors) or O(n) for CPU
(central compressor). Fortunately, the former appears easily satisfied
by available algorithms, so long as "n" is log event volume, not average
size of each log event
HTTP logs are extremly self-similar, so
just throwing Lempel-Ziv at them is sub-optimal. Experimenting, although
I've found descendants like LZMA do quite well (around 5x), that seems
to be the top end, at a not particularly impressive speed. This may be
great for general purpose compression but not for this special purpose.
they'll often have plenty of natural-language embedded within, large
text compressors (such as those tuned for the Hutter Prize cf. http://mattmahoney.net/dc/text.html)
aren't ideal, either. I speculate that this is due to a much higher
incidence of abbreviations and numerals, but I'm hardly qualified.
possibility would be to configure/customize ones web server to log in a
pre-compressed format. I generally reject this out of hand, because it
removes much of the self-documenting nature of verbose logs. Moreover,
it can't predict the future to determine the frequency of a current log
event. To do so would mean maintaining a buffer, which may as well be on
the disk of another server, the current situation. Perhaps more to the
point, my operational philosophy discourages burdening something
critical like a web server with something ancillary like log
The best option I've found so far is the PPMd
algorithm, primarily as implemented in softwarey by the 7zip package.
Specifically, with order 7 and 1GB of memory, a modern CPU will compress
my web logs 10:1 at 10MB/s. Its main disadvantages are being memory
heavy, with an identical footprint for compression and decompression and
lack of parallel implementation.
I don't yet have any good data,
partly because of the fast pace of startups means the character of the
logs I work with changes and partly because I rarely have the luxury of
trying more than one method on the same data. However, once I do, I'll
post some hard number comparisons between LZMA and PPMd with various
Next year, look for my musings on compression of database redo/write-ahead logs.