There are a several file systems that compress your data on the fly. The most notable ones are Windows’ NTFS, Linux’s BTRFS and ZFS which is used by several operating systems. It’s a great feature, saves your disk space and in many cases improves performance.
I created a benchmark that allows fairly accurate evaluation of various compression algorithms that are (or can be) used for this purpose. Thanks to Przemysław Skibiński for a more generic benchmark that he created which I heavily relied on. I replicated the way compression works in ZFS, almost exactly. Differences are for the sake of simplicity and shouldn’t have any measurable impact. First, background:
Hard disks store data in ‘sectors’ of usually 512 or 4096 bytes. This is the smallest unit that can be read or written at the time.
The time needed to read or write a 512 byte sector is almost the same as to r/w 8 such sectors, but with bigger ones it starts to grow. Therefore filesystems store data in blocks of usually 4 KB. With compression, data is first split into blocks and then compressed. When data is poorly compressible and you can’t save a full sector, it’s left uncompressed. Please note that when you have a 4k sector and 4k block, you can’t save anything with compression.
In this post I concentrate purely on strength of different algorithms in different scenarios. Performance will come later.
I tested the following algorithms:
ZFS uses LZJB and zlib, BTRFS – LZO (there are like 20 versions, I don’t know which one) and zlib, NTFS a proprietary algorithm.
2 data points:
I tried sector sizes of both 512 bytes and 4 KB and varied block size from 4 KB to 128 KB (the maximum value in ZFS).
On the x-axis is size in percentage of the original.
1. One thing that I never saw mentioned in ZFS guides is that larger blocks improve compression by a lot. In some cases it may halve your storage needs. I guess people skip it because block size has huge impact on performance and you should never change it unless you know what you’re doing. However, if you do know, it’s a trick that’s worth remembering.
Since I wrote this post I learned that 128k is not only the maximum, it’s the default block size in ZFS. So you can only loose here. It’s still valid for BTRFS though.
2. 4k sectors offer lower granularity, which hurts compression. In particular an 8k block has to be halved to get any savings. Not many are, so compression ratio is dreadful. I think that with 4k drives, for many workloads compression is not worth it, large blocks aren’t general purpose and you need them to get reasonable savings.
3. LZJB sucks. It’s clearly the weakest of contenders. Does performance save it? I’ll answer it in a later post.
4. Some guides recommend to use zlib -9 where performance doesn’t matter and LZJB elsewhere. It happens that -6 is practically just as strong.
5. Some algorithms scale better with block size then others. In particular:
You can download rough spreadsheets with more detailed results here.