Introduction
Inspired by Yann Collet’s post I decided to play with fuzz testing.
The idea is tempting, I do nothing and the computer tests code for me. Why not? I took a look at what people use (nearly all use afl). I took a look at how people use it. And saw that the big guys use clusters with thousands of cores fuzzing single project for years.
Oops. I could dedicate 2 old cores at the task. Maybe for a couple of months, but still my effort is puny. Could I find any bugs with what I have at all?
I decided to try anyway. After all, single developers (like Yann) don’t have clusters at disposal and still find fuzzing useful.
OK, I’m in. What should I fuzz?
Some widely used library like zlib is out, too well fuzzed already to be worthwhile for me.
A research codec like crush? I like it. I’m relatively likely to find bugs in it. But who cares? It has no pretensions to be useful in production and nobody uses it for anything but fun.
I decided to concentrate on something in between. Codecs that are
- considered to be production quality
- overall working good
- not version 1 committed last week
Initially, I picked bsc. Later on I extended the test to cover also:
Did I manage to find any bugs?
Yes, I did. In each of them in fact… In all but 2, within seconds from the first run of a proper build.
What did I find?
- bsc – A lot of bugs in decompression. One in compression too.
- density – After compiling with bug-detecting flags wouldn’t work, there was a division by zero in speed calculation when the file was small and compression time was 0. I fixed it and fuzzed out a lot of bugs.
- heatshrink – just a pair of broken assertions (one in compressor, one in decompressor). And the author has fixed them before I reported them.
- lzham – After compiling with bug-detecting flags wouldn’t work at all
- zpaq – After compiling with bug-detecting flags wouldn’t compress. Fuzzing decompression yielded several bugs.
- zstd – Quite resilent. After several hours of fuzzing I found a lone bug in decompression. I fuzzed it for another week, found nothing.
I have 3 thoughts concluding that:
- 2 programs that resisted my attempts better than others. Both have been afl-fuzz tested before. Both used other automated testing too; heatshrink – property based tests and zstd – a custom fuzzer.
- Building a program with instrumentation is a large portion of success. With 3 codecs I found bugs before I started fuzzing.
- It’s much harder to write correct code than I thought. All these codecs have been written by good devs. All but zstd exist for years. Not sure about density, but all others are used in production. All were easy to break, though heatshrink only with assertions on.
Finding bugs
You may wonder how to fuzz effectively? I’m still just a noob, but I have learned some lessons already.
There are 2 components for finding a bug by fuzzing.
- Your fuzzer has to run the code on data that causes fault condition and
- it has to notice the problem
Let’s concentrate on noticing first. Afl has no knowledge of what programs state is correct and what is not. It finds just one kind of errors. Crashes. Therefore the goal is to cause the program to crash in as many erroneous conditions as possible.
The first thing to do is enable assertions. If you’re a developer and you’re not used to writing them – start to do so. Combined with terabytes of fuzzer-generated data, they are a really good way of validating the assumptions that you make when writing the code.
Then, there’s clang with its wonderful sanitizers. The ones useful are:
One could enable them all, but I recommend that in general case it’s not a great idea. The reason is that Memory Sanitizer slows the code where I tried it by a factor of 20-30. It enables you to find one more class of bugs, but greatly reduces chances of finding others. It would make more sense to start without it and after afl stops finding new test cases that improve coverage, recompile with MSan and continue fuzzing for a while.
Also, docs for Undefined Behavior Sanitizer are broken and it’s not trivial to turn it on properly.
The basic usage is to add -fsanitize=undefined to the command line. However if you the read docs of another switch, -fsanitize-recover, you’ll learn that by default majority of checks caused by -fsanitize=undefined will not cause a crash but just print a message. You have to add -fsanitize-trap for all such checks. And there’s a common problem with UBSan and compressors on AMD64. This architecture allows unaligned memory access and codecs tend to exploit it. UBSan detects them and flags as error, so you should add -fno-sanitize=alignment too.
There are further switches that improve the situation:
-fstack-protector-all guards against stack buffer overflows. I’m not sure, with ASan it might be redundant.
-DFORTIFY_SOURCE=2 with some stdlibs makes functions like memcpy more aware of what buffers they operate on, so they can detect overflows too. Again it, may be redundant.
To sum up, the combinations of switches that I arrived at is:
-DFORTIFY_SOURCE=2 -fstack-protector-all -fsanitize=undefined,address -fsanitize-trap=array-bounds,bool,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,nonnull-attribute,null,object-size,returns-nonnull-attribute,shift-base,shift-exponent,signed-integer-overflow,vla-bound -fno-sanitize=alignment
With density and zpaq, compiling like that would take several hours. At least in case of density the reason was code bloat that I managed to fix by changing inlining settings. I didn’t investigate zpaq.
I should note however that this is not bulletproof. I’m not sure if it’s a clang bug, but I noticed cases when there would be an error printed, but no crash. I don’t know what to do about it. OK, I should ask llvm people, but so far I’m busy with other stuff.
Now, many types of errors will cause a crash. But there’s one more thing that can be done and of the tested programs only lzham does (and I’m not sure if it does in a fuzzer-detectable way). Add a switch for the compressor that, after it’s done, decompresses the data and verifies that it matches the original. After all, a compressor that destroys data w/out crashes is not a good one.
Initial testset
Generation of data is another important topic. First, if your codec has a checksum verification on decompression, disable checking it. It’s pointless to fuzz with files that get discarded before reaching your algorithm.
After that you need to create an initial test set. Several rules to follow:
- Try hard not to use large testsets. It really kills performance, both because it makes each run of your exe slower and because it slows down deterministic tests of afl itself. Afl even has a limit of 1 MB, though normally it’s way to large. The recommended size is up to 1 KB. I once got several days of fuzzing accumulated that I decided to discard. Almost all of the test files were ~64 KB in size and it would take c.a. one year to finish the first round. I created a new test set with the largest file being 5 KB. Initially it was much poorer, but overtook the original in several hours.
- Start with a diverse set of files. Get as many as you can and then run afl-cmin to trim the set down to cover only interesting cases. For zstd decompression, I started with:
- several standard benchmark files cut into small pieces (2-16 KB) and compressed; then filtered to include only the ones under 1 KB,
- 1 MB of zeros compressed. zstd splits data into 512 kB blocks, I wanted to have a case with more than one,
- Every file that I got while testing compression; again filtered to keep only smaller files,
- 2 highly compressible (but not as simple as a stream of zeros) files of 129 KB, so they are larger than window size. The larger was 5 KB compressed,A block of random data, compressed as well.
In all, that was 120 000 files that afl-cmin trimmed to under 500.
3. Sort the files by size, ascending. Afl fuzzes files in order determined by file name. When interesting file is found, it gets added to its fifo. When you can find the same interesting feature with 2 files, you want to keep the smaller one. Afl doesn’t do it, it keeps the first one. So you want it to find the smaller one first and sorting is a hack to do just that. The proper way would be to use a priority queue instead of a fifo in afl itself. I suggested it to the author but got no reply.
Running afl
Afl employs a number of strategies to find common errors. For example, it injects special integers like 0xFFFF, increases or decreases words or flips bits. Not all strategies work equally well.
For testing compression, there were 2 strategies that ruled:
- havoc, which does a lot of random changes,
- splice, which takes a front from one file, the back from another, joins these together and runs havoc on the result.
With splice finding 2-4 times as many test cases as havoc.
You can instruct afl to run only these strategies by running it with a -S switch. You can’t instruct it to use only slicing. A source modification for doing so would be welcome (doing it the dirty way is easy, though I believe command line switches with strategy selection would be much better).
Different compression levels result in different code paths. Therefore it’s important to fuzz all distinct ones. What I noticed is that a file that’s interesting for one level is often interesting for another too. Therefore it’s useful to run a number of afl-fuzz instances (even more than cores available) and make them share results. Afl does a sharing itself as long as all instances use -M or -S switch (the latter I mentioned already) and use the same output directory.
For decompression, havoc and splice are again the efficiency leaders (superiority of the latter surprises me), though bit flips are remarkably effective too. I recommend running a single master (-M) instance and a number of slaves (-S). Maybe even run the master with lowered priority, so the more effective finders get more CPU.
Summary
If you’re a codec developer, do your users a favour. Fuzz your code. And do it carefully. And every time you find a bug, sing along with me:
Fuzzing's great, fuzzing's good, fuzzing does what fuzzing should. Fuzzing's just, fuzzing's right, fuzzing is our guiding light. Always right, never wrong, we praise fuzzing in this song.
Also, share your test data. A readily available high-coverage test set makes it easier for others to do QA for your project on their own.