Different results when using GPU debug option -G

I had a problem running thrust while the -G flag was set, even though some features would work just fine.

When I execute in debug mode without -G flag the GPU code does run faster, but upon close examination I found that in this mode the non-thrust reduction/scans kernels result in different answers.

I did look at the CUDA Compiler Driver nvcc documentation;

http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc

but that did not explain this issue.

It is almost like it is ignoring the __syncthreads() commands within the kernels without the -G flag.

How specifically does this flag affect the code generation and execution?

This statement suggests it is a bit more complicated…

–device-debug (-G)
Generate debug information for device code, plus also specify the optimization
level for the device code in order to control its 'debuggability.
Allowed values for this option: 0,1,2,3.

Have you had a chance to check the “non-Thrust code” that produces the different results in debug and release builds for out-of-bounds accesses and race conditions with cuda-memcheck? Do any of the reductions in question involve floating-point arithmetic?

As far as I am aware, the compiler needs to turn off pretty much all optimizations for debug builds, so the resulting machine code can be very different from default (= fully optimized) builds.

I also assumed that I had made a mistake, so I took Jimmy Pettersson’s min max reduction code (slightly modified for my system , but the few changes were just to do with the timing functions).

Since his code is usually correct I ran that multiple times with the -G flag, and without. Without that flag the reduction answers were usually wrong, while with the flag the results were correct.

here is a paste bin of the exact code which only works with -G.

http://pastebin.com/sSmEn0uq

My own reduction code was simpler and still was incorrect without the -G flag. Maybe there is something I am missing, but I checked for memory errors and used Nvidia Nsight which indicated no obvious problems.

I can post more examples, but I used this one because I did not write the reduction code myself.

When I use only atomics and no shared memory, the -G flag makes no difference.

It is only when using shared memory with __syncthreads() when these -G problems arise.

I can paste in Nvidia samples reduction code verbatim and see the incorrect results without the -G flag.

Also the Nvidia samples properties(from the SDK) are set with the -G flag on, so when they run with this default configuration the answers are correct.

Sorry, that’s too much code for me to look at in detail. Does declaring the shared memory locations used in the reductions as volatile make any difference?

From what I have seen of problems with reductions through shared memory is that programmers tend to implictly rely on load ordering behavior that is not guaranteed. In that regard, the warp_reduce_min() and warp_reduce_max() functions look suspicious to me. I see nothing that prevents the compiler from re-arranging the code as follows (e.g. to schedule loads early to cover latency in the optimized build):

tempreg1 = smem[threadIdx.x+32]
tempreg2 = smem[threadIdx.x+16]
tempreg3 = smem[threadIdx.x+8]
tempreg4 = smem[threadIdx.x+2]
tempreg5 = smem[threadIdx.x+1]
tempreg6 = smem[threadIdx.x]
smem[threadIdx.x] = tempreg1 < tempreg6 ? tempreg1 : tempreg6;
smem[threadIdx.x] = tempreg1 < tempreg5 ? tempreg1 : tempreg5;
[…]

Clearly, such load re-ordering would destroy the desired behavior of the code. Since none of the smem[threadIdx.x+n] locations are written to, and they are not aliased to each other (common base pointer with mutually exclusive offsets), loads from these locations are all independent of each other as far as the compiler is concerned, and therefore can be re-ordered. As a related experiment, what happens when you insert __syncthreads() between consecutive lines in these two functions? This is one way of enforcing the desired order, as the compiler will not move a shared memory load across the barrier.

Bingo. That was it(inserting the __syncthreads() statements in the device warp_reduce functions).

What happened is that I used that exact min-max sample code as a template for my other reductions, assuming it was right because I had not written it and because it was running correctly in -G mode.

I had wondered about that ordering issue actually, but though that if it was the last warp the __syncthreads() statements were not required.

I was really confused by that,Thank you…

Using a __syncthreads() between each phase of a reduction is a conservative approach that should definitely work. What I think may work is declaring the memory locations involved in the reduction as “volatile”, without using __syncthreads(). It has been a while since I read the definition of “volatile” in the C/C++ standards, but it may impose just enough restrictions on the compiler to prevent the re-ordering of the shared memory loads in this context while allowing you to save the overhead of __syncthreads(). If you go back to the code that you used as a template, are the shared memory operands in that code using the “volatile” modifier?

Depending on context, a strictly binary reduction may not be the best approach for very small arrays. I have in the past made good use of the following two-phase approach: Have two to four threads scan and reduce their portions of a small array, then use a single thread to combine the partial results from these threads [with a single intervening __syncthreads()].

declaring the shared memory as ‘volatile’ also seems to work, and seems to be a bit faster.

Thanks again…