I am an extreme moderate

April 1, 2011

Flawed testing, part 1

Filed under: Rants — Tags: — niezmierniespokojny @ 10:03 am

I see there are common errors in benchmarking stuff.
The first thing is looking too close at arbitrary data points.
Let’s look at Computer Language Benchmarks Game for example.
It benchmarks languages for 3 major characteristics:

  • compiled (interpreted) code run time
  • compiled (interpreted) code memory usage
  • code size
  • Let’s plot size vs. speed (and ignore memory for the sake of simplicity):

    size vs. speed

    I highlighted the following:

  • C++ GNU g++
  • C GNU gcc
  • Scala
  • Lua LuaJIT
  • JavaScript V8
  • Python PyPy
  • You can see that they are Pareto frontiers, which means that each may be beaten in one metric, but not in both.

    But let’s see another visualization of the same data:

    size vs. speed - hiperbole

    As you can see, Pareto frontiers form a kind of hyperbola. It’s not well visible, because we have too little data. But such shape is a usual thing. The harder you try to optimize for speed – the more simplicity you have to sacrifice for gains that keep getting smaller. The more you simplify – the slower it works. Just regular tradeoffs.

    So where’s the problem?
    Each of the languages has its own hyperbola. There’s more than 1 way of writing a program in each of them, yet they are represented just by 1 point. What’s worse, by an extreme point. Let’s speculate about hyperbolas for some of them:

    size vs. speed - hiperboles

    The drawing may be way off, I didn’t think about it a lot, let alone test it. The problem is that implementations that are the fastest for a particular language are never even close to being efficient in terms of size. The globally fastest language (C++) is Pareto optimal. But the other data points are almost useless. Really, probably none of them would be a frontier if we allowed more than 1 score / language. They show language performance limitations. But efficiency? Not at all. So what does code size tell in this test? Nothing. Null. Nada. What does memory usage tell us in a test like this? Exactly the same.
    Though in some cases it’s somewhat better – if you can use the benchmark data to estimate how would it work in your case and know ‘Even when optimized totally for parameter X, parameter Y is fair enough’.

    And the problem is common. The Computer Language Benchmarks Game is otherwise a great resource, yet there is a major flaw in their methods.
    Just like this compression comparison. Does wavpack compress faster than flac? In a quick test that I performed – no, it’s 300 times slower. But with the compression settings that I chose it’s much stronger. Does it mean it’s much stronger? No, we’d have to draw a chart for each of them and then compare. Like it’s done here.

    flac vs. wavpack

    It shows that in fast modes, flac is both faster and stronger than wavpack, but in slower ones it’s the opposite.
    It also shows that TAK and ape are very flexible – they can perform well in a wide range of speed vs. time tradeoffs.

    I’m sure one could find many examples outside of IT too. Tradeoffs are universal. Letting user choose where to make them is not uncommon. So be careful what do you do with data that you gather.

    Blog at WordPress.com.

    %d bloggers like this: