Adaptive Log Compression for Massive Log Data

[Overview] [Puplications] [Source Code] [Data Sets] [Contact]


Log data is ubiquitous and humongous. The standard log compression method is to compress the entire log data together using tools such as bzip2 or gzip as one file.

In practice, log entries are often heterogeneous, with varying patterns over time. They also have strong temporal locality. Thus, a better approach is to adaptively distribute entries to different buckets, and compress buckets separately in parallel.

Several methods for determining what bucket to place an incoming log were studied. The effectivness of the methods were comparied when compressing several real world log.


1. Improving Compression of Massive Log Data,

    Full Version:  

2. Adaptive Log Compression for Massive Log Data,

    abstract:   poster:  

Source Code

The software used for the papers can be found on github by following this link

Data Sets

Test data used in this project can be found here.

I have become aware (as of July 2013) that the link may not be working. Since this is the link I used to download the data for myself, I will leave the link here until I am notified of the data's new location.


If you have any questions regarding the code or publications, feel free to ask those questions by sending an email to the author Robert Christensen