possible bug in SDK project/template timing Debug version runs faster than release

The timing for the SDK 1.1 project ‘template’ seems to be badly broken, with the debug version running more than 250 X FASTER than the release version.

I have run this on both Linux with 1.1, and the Mac OS X Beta with similar results.

I haven’t found this reported already in these forums, I apologise in advance if this is a well worn topic. If someone has an explanation, PLEASE point me at it or explain, it’s driving me nuts.

Example run

NVIDIA_CUDA_SDK$ for i in 1 2 3 4 5 6 7 8 9 0

> do

> echo -n 'release: '; ./bin/linux/release/template -noprompt

> echo -n 'debug: '; ./bin/linux/debug/template -noprompt

> done

release: Processing time: 34.004002 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

release: Processing time: 34.013000 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

release: Processing time: 34.319000 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

release: Processing time: 34.131001 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

release: Processing time: 34.127998 (ms)

Test PASSED

debug: Processing time: 0.158000 (ms)

Test PASSED

release: Processing time: 34.119999 (ms)

Test PASSED

debug: Processing time: 0.129000 (ms)

Test PASSED

release: Processing time: 34.422001 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

release: Processing time: 34.138000 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

release: Processing time: 34.001999 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

release: Processing time: 34.471001 (ms)

Test PASSED

debug: Processing time: 0.130000 (ms)

Test PASSED

As you can see, the release version runs in 34 ms, and debug in 0.13 ms, so debug is more than 250 X faster than release.

This was run on:

SDK 1.1 compiled with Tools 1.1

Athlon 64 X2 4600+, 4GB Ram, Asus M2N32-SLI Delux

GeForce 8800 GTS, 320MB memory, VBIOS 60.80.0d.00.61 - running as the only display

OpenSuse 10.2, with standard, out-of-the-box gcc (GCC) 4.1.2 20061115

NVIDIA Driver Version: 169.09

I have also run release and debug versions on MacBook Pro, 2.2GHz COoe 2 Duo

with GeForce 8600M GT, and 128MB VRAM

OS X 10.5.2, with the Leopard graphics updates, gcc 4.0.1

While this MAcBook Pro configuration isn’t recommended, there is enough VRAM to run the debug version and pass the tests, so I assume it is working correctly. It gives similar ‘wrong’ results, though slower of course. The details are in my posting in the Mac OS X forum.

I have tried other projects, e.g. matrixMul, which work as I would expect:

NVIDIA_CUDA_SDK$ ./bin/linux/release/matrixMul -noprompt

Processing time: 0.158000 (ms)

Test PASSED

NVIDIA_CUDA_SDK$ ./bin/linux/debug/matrixMul -noprompt

Processing time: 0.179000 (ms)

Test PASSED

Another project which seems to give counter intuitive results is simpleTemplates where release is much slower than debug, but for float only:

NVIDIA_CUDA_SDK$ ./bin/linux/release/simpleTemplates -noprompt

Processing time: 34.540001 (ms)

Test PASSED

Processing time: 0.080000 (ms)

Test PASSED

NVIDIA_CUDA_SDK$ ./bin/linux/debug/simpleTemplates -noprompt

Processing time: 0.131000 (ms)

Test PASSED

Processing time: 0.100000 (ms)

Test PASSED

I have added direct calls to gettimeofday() on Linux and to the OS X timing counters (mach_absolute_time()), and they confirm that the results are accurate, release is much slower than debug.

So, HELP! What is the explanation?

This particular manifestation of the issue has not been discussed to death on the forums, but the root cause has: improper timing.

The template project (and I assume the simpleTemplates one too) start timing directly after the call to CUT_DEVICE_INIT. CUT_DEVICE_INIT performs a cudaThreadSynchronize(), but only in debug mode. Thus: in release mode the time recorded includes CUDA driver initialization time and in debug mode that time is not included.

As can be found in lots of posts on the forums: proper timing for benchmarking should always be done like this:

initizlize

call kernel once

cudaThreadSynchronize()

record wall clock start

call kernel 100's of times

cudaThreadSynchronize()

record wall clock end

Anyways, with particular regard to the simple template projects: I wouldn’t consider improper timing a bug. The template projects are there only to provide an example of how to setup a project or a bit of code, not to demonstrate proper benchmarking in CUDA.

Thank you, that is very helpful. I remember creating new versions of those macros ages ago (0.8), now I remember why :no:

I believe I have some grasp of how to benchmark, and running release and debug versions of template could hardly be classed as a benchmark :)

The “record wall clock start, do something100’s of times, record wall clock end” method is not a good approach that should be recommended for benchmarking.

I recommend anyone interested in benchmarking looking at Zed Shaw for comments on how to measure things, I realise he has a writing style which may be hard to bare, and I really wish I had a an easier to read reference, but he is making some good points. Skip over his rant to the “Power-of-Ten Syndrome” section on down.

It is typically much better to measure a set of runs, even across the same total number of iterations, to see how consistent the results are, and detect bias. This is especially important if you can only use wall clock time. Further, it’s practical to determine a ‘good number’ of iterations based on the amount of variability that is seen. I tend to use the on-board time function clock(), and return the value from each block, then analyse runs in a spreadsheet.

You are of course entitled to your opinion. I disagree with that opinion in this specific case. Let me explain why.

You go on to say:

I strongly agree with that view. I believe the template is offered by NVIDIA as a good starting point, it has no other purpose. It is there to help developers get started, and IMHO should be a model of how to write reasonable code. It doesn’t have to be brilliant code, just good, solid, reliable, predictable code. I believe examples should aim to show good practice. That’s why it contains a bug; it fails to demonstrate good practice, or explain why it contains poor practice and show what good practice is.

It seems inconsistent to suggest that misleading or confounding behaviour without an explanation is not a bug. I think that is one of the working definitions of ‘bug’.

I would be willing to accept any of the following as an improvement of template:

  1. It uses cudaThreadSynchronize() explicitly to ensure release and debug are broadly comparable and to ensure developers see that this is important (with a comment),

  2. It prints a warning to say the timings are not comparable between release and debug, along with a brief explanation or reference to an explanation,

  3. CUT_DEVICE_INIT() always calls cudaThreadSynchronize(). This is probably contentious, and may break existing code, so I can understand and why this is not an acceptable fix.

Thank you again for the cudaThreadSynchronize() pointer, it is appreciated.

GB

Well, the mantra to run a kernel hundreds of times is just the simple version I put on the forums. The main point I try to get across with that is you can’t just measure a single kernel run and to emphasize the proper usage of cudaThreadSynchronize(). And the main reason for 100’s is not statistics: it is the precision of the clock on the computer.

For real benchmarks of individual kernels, I run them at least 100 times or for up to 5 seconds: whichever is longer. These target kernels which normally run in ~5 milliseconds. Depending on the run time of the kernel, numbers should be adjusted accordingly.

Zed Shaw has a some interesting points concerning the performance measurements of a particular piece of code over a variety of datasets. I especially like his discussion of “averages only”, where an operation performed in ~30 units of time on average often spikes to ~80 which is obviously a lot slower.

But none of these apply to what I suggested as a benchmarking strategy. I guess I didn’t explicitly state it but as the kernel is called 100’s of times it is to perform exactly the same amount of work each time. The sole purpose is just to build up enough running time to precisely measure the average time per call. Since every call is identical, the running time is identical each time (assuming no other load on the system).

Hence there is no distribution of times to measure, we are only sampling random measurements of the same value. Thus there is nothing wrong with the “power of ten syndrome”. The precision of the measurement will be controlled by the law of averages and scale as 1/sqrt(N_measurements). Doing 10 runs of 100 calls will be no different than 1000 calls in this case.

But, as I stated: In real world tests with varying datasets, building a distribution of performance measurements is certainly beneficial. Just keep in mind that each point in the distribution should have the same kernel call w/ the same data averaged over many seconds, or otherwise your distribution will show a large variation in timings simply due to the lack of precision in wall clock timers on the computer.

As I wrote, this is not about benchmarking.

It is about a piece of example code, offered as a good starting point for developers by NVIDIA. I think it exhibits counterintuitive and hidden behaviour. The varying number of cudaThreadSynchronize() calls (depending on whether it is compiled for release or debug) reinforces that view.

Where would one get the more accurate or complete version? Would you provide a reference, please?

I strongly encourage everyone interested in measurement and benchmarking to read Zed Shaw, skipping his rants. He tries to give some helpful insight into measurement in order to understand system behaviour.

Of course, a key to measurement is to understand what it is your trying to get out of the process.

My goal was to demonstrate that the sample code, offered in the 1.1 SDK as an initial project basis for developers, when compiled with different flags (debug vs release) exhibited very different, and counterintuitive behaviour. A set of interleaved, sequential runs showed more than two orders of magnitude difference, with highly correlated values (less than 10% variation). Job done.

GB

I saw no such statement. In fact, you responded to my first post with quite a long post on benchmarking insinuating that I know nothing about it, so I felt the need to defend myself point in a 2nd post.

I’m sorry for every mentioning benchmarking in the first place, but from your original post it seemed that you had the same misconception about timing kernels that the average CUDA newbie has about the timing of kernels with cudaThreadSynchronize and the driver initialization time. So I thought to correct that by mentioning benchmarking. As it turns out, my judgment was incorrect. There is no more need to mention benchmarking again.

Agreed. You could also become a registered developer (http://nvdeveloper.nvidia.com/content/CDUDeveloperApplication/frmDeveloperRegistration.asp) and post a bug to the bug tracker to get these issues resolved.

I apologise for creating that discomfort. I will leave benchmarking out too.

Thank you. That’s a helpful suggestion. I’ll pursue that path.