possible bug in SDK project/template timing Debug version runs faster than release

GarryB · April 9, 2008, 4:00pm

Thank you, that is very helpful. I remember creating new versions of those macros ages ago (0.8), now I remember why External Media

As can be found in lots of posts on the forums: proper timing for benchmarking should always be done like this:
initizlize

call kernel once

cudaThreadSynchronize()

record wall clock start

call kernel 100's of times

cudaThreadSynchronize()

record wall clock end

I believe I have some grasp of how to benchmark, and running release and debug versions of template could hardly be classed as a benchmark :)

The “record wall clock start, do something100’s of times, record wall clock end” method is not a good approach that should be recommended for benchmarking.

I recommend anyone interested in benchmarking looking at Zed Shaw for comments on how to measure things, I realise he has a writing style which may be hard to bare, and I really wish I had a an easier to read reference, but he is making some good points. Skip over his rant to the “Power-of-Ten Syndrome” section on down.

It is typically much better to measure a set of runs, even across the same total number of iterations, to see how consistent the results are, and detect bias. This is especially important if you can only use wall clock time. Further, it’s practical to determine a ‘good number’ of iterations based on the amount of variability that is seen. I tend to use the on-board time function clock(), and return the value from each block, then analyse runs in a spreadsheet.

You are of course entitled to your opinion. I disagree with that opinion in this specific case. Let me explain why.

You go on to say:

I strongly agree with that view. I believe the template is offered by NVIDIA as a good starting point, it has no other purpose. It is there to help developers get started, and IMHO should be a model of how to write reasonable code. It doesn’t have to be brilliant code, just good, solid, reliable, predictable code. I believe examples should aim to show good practice. That’s why it contains a bug; it fails to demonstrate good practice, or explain why it contains poor practice and show what good practice is.

It seems inconsistent to suggest that misleading or confounding behaviour without an explanation is not a bug. I think that is one of the working definitions of ‘bug’.

I would be willing to accept any of the following as an improvement of template:

It uses cudaThreadSynchronize() explicitly to ensure release and debug are broadly comparable and to ensure developers see that this is important (with a comment),
It prints a warning to say the timings are not comparable between release and debug, along with a brief explanation or reference to an explanation,
CUT_DEVICE_INIT() always calls cudaThreadSynchronize(). This is probably contentious, and may break existing code, so I can understand and why this is not an acceptable fix.

Thank you again for the cudaThreadSynchronize() pointer, it is appreciated.

GB

Topic		Replies	Views
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13524	July 9, 2008
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1951	January 12, 2019
Outrageous speedups over serial code: advice needed CUDA Programming and Performance	17	2815	November 18, 2012
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2805	November 17, 2021
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	8804	October 13, 2010
Number of GPU clock cycles CUDA Programming and Performance	15	10456	June 16, 2017
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1776	July 19, 2022
Problem with read access violation for large arrays in unified memory CUDA Programming and Performance	16	3073	July 16, 2018
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8636	December 18, 2008
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12754	July 26, 2010

possible bug in SDK project/template timing Debug version runs faster than release

Related topics