I am an extreme moderate

April 1, 2011

Flawed testing, part 1

Filed under: Rants — Tags: — niezmierniespokojny @ 10:03 am

I see there are common errors in benchmarking stuff.
The first thing is looking too close at arbitrary data points.
Let’s look at Computer Language Benchmarks Game for example.
It benchmarks languages for 3 major characteristics:

  • compiled (interpreted) code run time
  • compiled (interpreted) code memory usage
  • code size
  • Let’s plot size vs. speed (and ignore memory for the sake of simplicity):

    size vs. speed

    I highlighted the following:

  • C++ GNU g++
  • C GNU gcc
  • Scala
  • Lua LuaJIT
  • JavaScript V8
  • Python PyPy
  • You can see that they are Pareto frontiers, which means that each may be beaten in one metric, but not in both.

    But let’s see another visualization of the same data:

    size vs. speed - hiperbole

    As you can see, Pareto frontiers form a kind of hyperbola. It’s not well visible, because we have too little data. But such shape is a usual thing. The harder you try to optimize for speed – the more simplicity you have to sacrifice for gains that keep getting smaller. The more you simplify – the slower it works. Just regular tradeoffs.

    So where’s the problem?
    Each of the languages has its own hyperbola. There’s more than 1 way of writing a program in each of them, yet they are represented just by 1 point. What’s worse, by an extreme point. Let’s speculate about hyperbolas for some of them:

    size vs. speed - hiperboles

    The drawing may be way off, I didn’t think about it a lot, let alone test it. The problem is that implementations that are the fastest for a particular language are never even close to being efficient in terms of size. The globally fastest language (C++) is Pareto optimal. But the other data points are almost useless. Really, probably none of them would be a frontier if we allowed more than 1 score / language. They show language performance limitations. But efficiency? Not at all. So what does code size tell in this test? Nothing. Null. Nada. What does memory usage tell us in a test like this? Exactly the same.
    Though in some cases it’s somewhat better – if you can use the benchmark data to estimate how would it work in your case and know ‘Even when optimized totally for parameter X, parameter Y is fair enough’.

    And the problem is common. The Computer Language Benchmarks Game is otherwise a great resource, yet there is a major flaw in their methods.
    Just like this compression comparison. Does wavpack compress faster than flac? In a quick test that I performed – no, it’s 300 times slower. But with the compression settings that I chose it’s much stronger. Does it mean it’s much stronger? No, we’d have to draw a chart for each of them and then compare. Like it’s done here.

    flac vs. wavpack

    It shows that in fast modes, flac is both faster and stronger than wavpack, but in slower ones it’s the opposite.
    It also shows that TAK and ape are very flexible – they can perform well in a wide range of speed vs. time tradeoffs.

    I’m sure one could find many examples outside of IT too. Tradeoffs are universal. Letting user choose where to make them is not uncommon. So be careful what do you do with data that you gather.

    2 Comments »

    1. > So what does code size tell us in a test like this? Nothing. Null. Nada. What does memory usage tell us in a test like this? Exactly the same.

      So it doesn’t tell you anything about “the fastest programs for each programming language implementation, measured on this OS/machine.” ?

      2) Incidentally, what’s with the uneven grid ticks at 12.59 15.85 25.12 etc ?

      Also, as you are only interested in half-a-dozen points, your reasoning might be easier to follow if

      – you use a simple circle or square for the data points

      – don’t show all those language labels

      – just label directly on the chart the 6 data points you want to talk about

      Comment by Isaac Gouy — April 15, 2011 @ 5:56 pm

      • > So what does code size tell us in a test like this? Nothing. Null. Nada. What does memory usage tell us in a test like this? Exactly the same.
        So it doesn’t tell you anything about “the fastest programs for each programming language implementation, measured on this OS/machine.” ?

        OK, but it’s useless because if you care about anything but speed, you don’t optimize purely for speed.
        It could be useful if it let you predict code size or memory usage of your application and know “A program optimized purely for speed fares well enough, so we don’t have to bother”. But it doesn’t.
        I’m thinking how is it when we move from the languages to a more general case….and I think it can be different. That is – sometimes you may be able to predict how is it in your case and know it’s OK. So I should fix the post somehow.
        Thank you.

        information.
        2) Incidentally, what’s with the uneven grid ticks at 12.59 15.85 25.12 etc ?

        Logarithmic scale. The points were very cramped with a regular one. And the hiperbola was very steep on it, it looks better like this. 😉

        Also, as you are only interested in half-a-dozen points, your reasoning might be easier to follow if
        – you use a simple circle or square for the data points
        – don’t show all those language labels
        – just label directly on the chart the 6 data points you want to talk about

        Thank you for the feedback, I spent a while thinking about it and decided to leave it as it is. I think it highlights the point that the supposed Pareto frontiers might not be at the front at all.

        Comment by niezmierniespokojny — April 17, 2011 @ 3:21 pm


    RSS feed for comments on this post. TrackBack URI

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    Blog at WordPress.com.

    %d bloggers like this: