possible bug in SDK project/template timing Debug version runs faster than release

Thank you, that is very helpful. I remember creating new versions of those macros ages ago (0.8), now I remember why External Media

I believe I have some grasp of how to benchmark, and running release and debug versions of template could hardly be classed as a benchmark :)

The “record wall clock start, do something100’s of times, record wall clock end” method is not a good approach that should be recommended for benchmarking.

I recommend anyone interested in benchmarking looking at Zed Shaw for comments on how to measure things, I realise he has a writing style which may be hard to bare, and I really wish I had a an easier to read reference, but he is making some good points. Skip over his rant to the “Power-of-Ten Syndrome” section on down.

It is typically much better to measure a set of runs, even across the same total number of iterations, to see how consistent the results are, and detect bias. This is especially important if you can only use wall clock time. Further, it’s practical to determine a ‘good number’ of iterations based on the amount of variability that is seen. I tend to use the on-board time function clock(), and return the value from each block, then analyse runs in a spreadsheet.

You are of course entitled to your opinion. I disagree with that opinion in this specific case. Let me explain why.

You go on to say:

I strongly agree with that view. I believe the template is offered by NVIDIA as a good starting point, it has no other purpose. It is there to help developers get started, and IMHO should be a model of how to write reasonable code. It doesn’t have to be brilliant code, just good, solid, reliable, predictable code. I believe examples should aim to show good practice. That’s why it contains a bug; it fails to demonstrate good practice, or explain why it contains poor practice and show what good practice is.

It seems inconsistent to suggest that misleading or confounding behaviour without an explanation is not a bug. I think that is one of the working definitions of ‘bug’.

I would be willing to accept any of the following as an improvement of template:

  1. It uses cudaThreadSynchronize() explicitly to ensure release and debug are broadly comparable and to ensure developers see that this is important (with a comment),

  2. It prints a warning to say the timings are not comparable between release and debug, along with a brief explanation or reference to an explanation,

  3. CUT_DEVICE_INIT() always calls cudaThreadSynchronize(). This is probably contentious, and may break existing code, so I can understand and why this is not an acceptable fix.

Thank you again for the cudaThreadSynchronize() pointer, it is appreciated.

GB