Everyone should read this if you have written a CUDA app

This was a tutorial given at ISC this year. It describes why most methodologies for comparing speedups are flawed and proposes an alternative.

http://hal.archives-ouvertes.fr/docs/00/45…estDocument.pdf

I know that it is 38 pages, but please read the whole thing.

The slides can be found here:

http://hal.archives-ouvertes.fr/inria-00443839

Edit: This does not address the problem of having an unoptimized base-line, but it is still very important.

I think that’s the point: I generally don’t trust reported speedup values anyway, so no amount of fancy statistics will help here.

Just the theoretical ratios should be a good indicator, no?

For example:
Intel Core2Duo : 14GFLops (SP or DP? - Lets assume SP)
TESLA : 933GFlops (it includes interpolation stuff as well I think) – without that it should be around 700GFLOps

Theoretical ratio stands around 50x… – which is a good indicator.

People do claim 200x, 2000x and even 20000x… (including me…) It all boilds down to how good the CPU implementation is…

Anyway, I am going to read the PDF now… Lets see what they say…
Thanks for bringing this up Greg!

On Hindsight – This paper will be useful for general HPC world… but when the speedups are so obvious (even 40x…) – I dont think the deviating factors play a role…
But I really liked the DVS thing (Room temperature)… Thats very very interesting… I think some1 in this forum had an under-voltage issue (not related to room temperature though) that was causing rare crashes…

Anycase, the document is good to read and some of the tips are very useful! Thanks!

Their methodology would be the most useful when comparing results using the same hardware, ie the same gpu. The paper boils down to these key points:

  1. Avoid measurement bias.

  2. Compare apples to apples.

  3. Don’t throw away outliers.

  4. Determine if a result is significant before reporting it.

And this is at least what I took away from their paper.

1). Reported statistics (mean, min, avg execution times) are not accurate reflections of what an arbitrary user could expect to see if they are biased. We should actively take steps to reduce the bias in our measurements, for example, rather than looping over multiple instances of the same kernel and recording the average time, launch the application multiple times, run other applications in between, reboot the system occasionally.

2). Run all experiments on identically configured systems, using the same inputs to the application, compiled with the same optimization level, using the same version of the driver, the compiler, etc.

  1. Don’t ever throw away a result unless it was due to an error on your part. True outliers should be rare from a statistical perspective, meaning that they will be washed out if you include enough samples. If you see it, and it significantly affects your results, it is likely that someone using your program would also see it.

  2. The whole point of this is to recognize that an observed speedup of 5x is just a single observation. Non-determinism or measurement inaccuracy in the system could cause the observed speedup to fluctuate between runs. Actually perform the analysis to determine whether someone else who runs your application is likely to achieve the same result. They even give you a script on there that will do the fancy statistics for you.

As for comparisons between completely different architectures (cpus vs gpus) and different applications written by different people, just don’t do it. Report how well your GPU application did compared to the theoretical max of the GPU you ran it on. If you want to show how much better it is than something on a CPU, report how well the CPU implementation did compared to the theoretical max of the CPU that was used. And use this methodology to obtain both measurements.

I would agree that the overall hypothesis that an approach is effective will probably not be affected if the speedup is 40x, but 40x might not be the right number to report. You still want to record a large number of samples, determine the median, don’t throw away outliers, etc, and then report a final value that includes significance. For example, rather than saying 40x, you would say that the 95% confidence interval is between 38x and 41.5x. Something like [20x, 40.1x] would be far less impressive, but you still might end up reporting 40x if you only took one sample, or took the average.

Perhaps this is new, or at least not intimately familiar to computer science or electrical engineering types. As a physicist, these are some of the core principles of doing an experiment. My read on the paper is that someone is finally advocating application of good experimental technique to reporting results in CS (as opposed to marketing hype, which kind of tries to avoid all of the above principles and simply promote the product that is paying for the work to be done.)

Regards,

Martin

I don’t think it is really “new” to computer science, but many people don’t appreciate the importance. (I know that physics students I’ve taught, and even some physicists struggle with it from time to time.) Basically, this paper is pointing out that run time on a modern computer system is a stochastic process, and so the definition of a measurement has to change.

Convincing people of that is the real challenge. The methodology follows naturally.

I’m a bit more cynical about all this. Sure, it’s great to know what the “real” speedup of something is, but it’s not worth the time investment it requires to properly measure it. It’s better, instead, to treat the speedup measurements as indicators of possible performance and use them to guide your optimization efforts. Furthermore, if you force yourself to only consider the end user, then you have to consider the performance impact in the application the user is actually using. With this approach, you overlook the possible impact that accelerating components (e.g. matrix operations) can have in other application areas.

By analogy, user studies are all the rage in the VR research community as a way of demonstrating that e.g. user interaction techniques are actually effective. However, almost no one is willing to invest the time or effort to work with a psychologist and a mathematician to create a proper user study and derive proper statistics from it. For one, there simply isn’t enough space in 8 pages of conference paper to present a new technique and describe a thorough user study used to test it. For another, it’s not worth the added time investment: a minimal user study is sufficient to get a paper accepted and present a rough idea of how useful a technique is.

Going slightly out of context here — I would really like to publish papers, like “10 Different ways of how NOT to do xxxxx”, followed by “Right ways” to do it…
One should also know the failures that have gone in some work… It may not help experts… but sure, beginners would like it.

We really need a system that concentrates of failures – it will help a lot in the learning process… What u guys think?

It depends on what your goal is. If you goal is to get a paper published in pretty much any academic conference, then no, it is probably not worth the investment. As you say:

There are equivalents in CS and EE circuits/architecture/compilers/etc conferences.

If your goal is to determine whether or not some hardware optimization, algorithm, or new application is actually better than the state of the art, that may require a slightly different approach. Why do most of the work designing it if you don’t care enough to actually evaluate it thoroughly?

Even if you cut most of the experimental methodology out of the paper due to space constraints, it still improves the quality of your work to actually do it because it gives you a chance to filter out techniques that underperform before you ever try to publish them. Just cite someone else who described their methodology and say that you did the same thing. In this case, how does presenting medians with confidence intervals take up any more space than just the average?

Usually the way I read papers is to actually read the descriptions of the new technique they are evaluating, try to read the experiments, realize that there isn’t enough information to tell whether or not the technique is useful, stop reading and judge the work based on how reasonable the description of the new technique sounded. If I actually want to use the technique, I have to implement it myself and evaluate it myself.

I would argue that doing more evaluation up front before you publish something makes it more likely that other people will actually use your new idea.

I think that it would be valuable, but probably not the type of thing that most conferences are looking for. A conference would still be interested in things that don’t work, but more of things an expert would consider a good idea that ended up not working for some reason that wasn’t obvious.

That being said, no one is stopping you from writing it and posting it somewhere for beginners to find.

I think you really underestimate the amount of effort proper assessments take. In VR research, a proper user study would require a full journal length paper to present and analyze. That is, a proper analysis of a technique is a stand-alone piece of research that takes at least equal amounts of time and effort as the development of the original technique. Cramming both together into one paper is an injustice and cannot properly treat either. Better a summary analysis and a detailed description of the technique than a poor description with a so so analysis.

It of course varies from field to field. This tutorial deals specifically with high performance computing where the overheads of a more rigorous analysis are not substantial, and still it is commonly not done. Though this is more out of an expectation that measurements are deterministic (which was probably true in the past).

In VR, is it common for researchers who develop new techniques to also evaluate them in subsequent work? Or is the task left to others? I am curious how it is possible to evaluate the effectiveness of a new technique without an in-depth analysis.

In other fields, such as physics, it is my understanding that it is expected that the evaluation of a new technique is a substantial, standalone effort. This is reflected in the frequency of new publications in that field compared to, for example, computer architecture. It is also reflected in the focus on journal papers rather than conference papers.

On factor is where you think the science is. Is the science in the development of something new that might have a speedup of anywhere from 2x to 10x or in the development of something new that has a speedup of exactly 5x? Another factor is how you look at the benchmarking process itself. Should you benchmark against a fully optimized CPU implementation even if one doesn’t exist? To get the true and exact speedup of a GPU implementation, this is necessary. However, to demonstrate that a GPU accelerated implementation is faster than something that is currently in common use, it’s not required.

Back on topic, I’ve just read the paper and found it to be quite an easy and interesting read. Highly recommended.