Different results when using GPU debug option -G

CudaaduC · July 12, 2013, 1:55am

I had a problem running thrust while the -G flag was set, even though some features would work just fine.

When I execute in debug mode without -G flag the GPU code does run faster, but upon close examination I found that in this mode the non-thrust reduction/scans kernels result in different answers.

I did look at the CUDA Compiler Driver nvcc documentation;

[url]CUDA Toolkit Documentation

but that did not explain this issue.

It is almost like it is ignoring the __syncthreads() commands within the kernels without the -G flag.

How specifically does this flag affect the code generation and execution?

This statement suggests it is a bit more complicated…

–device-debug (-G)
Generate debug information for device code, plus also specify the optimization
level for the device code in order to control its 'debuggability.
Allowed values for this option: 0,1,2,3.

njuffa · July 12, 2013, 6:56am

Have you had a chance to check the “non-Thrust code” that produces the different results in debug and release builds for out-of-bounds accesses and race conditions with cuda-memcheck? Do any of the reductions in question involve floating-point arithmetic?

As far as I am aware, the compiler needs to turn off pretty much all optimizations for debug builds, so the resulting machine code can be very different from default (= fully optimized) builds.

CudaaduC · July 13, 2013, 12:50am

I also assumed that I had made a mistake, so I took Jimmy Pettersson’s min max reduction code (slightly modified for my system , but the few changes were just to do with the timing functions).

Since his code is usually correct I ran that multiple times with the -G flag, and without. Without that flag the reduction answers were usually wrong, while with the flag the results were correct.

here is a paste bin of the exact code which only works with -G.

[url]http://pastebin.com/sSmEn0uq[/url]

My own reduction code was simpler and still was incorrect without the -G flag. Maybe there is something I am missing, but I checked for memory errors and used Nvidia Nsight which indicated no obvious problems.

I can post more examples, but I used this one because I did not write the reduction code myself.

CudaaduC · July 13, 2013, 2:24am

When I use only atomics and no shared memory, the -G flag makes no difference.

It is only when using shared memory with __syncthreads() when these -G problems arise.

I can paste in Nvidia samples reduction code verbatim and see the incorrect results without the -G flag.

Also the Nvidia samples properties(from the SDK) are set with the -G flag on, so when they run with this default configuration the answers are correct.

njuffa · July 13, 2013, 3:07am

Sorry, that’s too much code for me to look at in detail. Does declaring the shared memory locations used in the reductions as volatile make any difference?

From what I have seen of problems with reductions through shared memory is that programmers tend to implictly rely on load ordering behavior that is not guaranteed. In that regard, the warp_reduce_min() and warp_reduce_max() functions look suspicious to me. I see nothing that prevents the compiler from re-arranging the code as follows (e.g. to schedule loads early to cover latency in the optimized build):

tempreg1 = smem[threadIdx.x+32]
tempreg2 = smem[threadIdx.x+16]
tempreg3 = smem[threadIdx.x+8]
tempreg4 = smem[threadIdx.x+2]
tempreg5 = smem[threadIdx.x+1]
tempreg6 = smem[threadIdx.x]
smem[threadIdx.x] = tempreg1 < tempreg6 ? tempreg1 : tempreg6;
smem[threadIdx.x] = tempreg1 < tempreg5 ? tempreg1 : tempreg5;
[…]

Clearly, such load re-ordering would destroy the desired behavior of the code. Since none of the smem[threadIdx.x+n] locations are written to, and they are not aliased to each other (common base pointer with mutually exclusive offsets), loads from these locations are all independent of each other as far as the compiler is concerned, and therefore can be re-ordered. As a related experiment, what happens when you insert __syncthreads() between consecutive lines in these two functions? This is one way of enforcing the desired order, as the compiler will not move a shared memory load across the barrier.

CudaaduC · July 13, 2013, 3:18am

Bingo. That was it(inserting the __syncthreads() statements in the device warp_reduce functions).

What happened is that I used that exact min-max sample code as a template for my other reductions, assuming it was right because I had not written it and because it was running correctly in -G mode.

I had wondered about that ordering issue actually, but though that if it was the last warp the __syncthreads() statements were not required.

I was really confused by that,Thank you…

njuffa · July 13, 2013, 3:31am

Using a __syncthreads() between each phase of a reduction is a conservative approach that should definitely work. What I think may work is declaring the memory locations involved in the reduction as “volatile”, without using __syncthreads(). It has been a while since I read the definition of “volatile” in the C/C++ standards, but it may impose just enough restrictions on the compiler to prevent the re-ordering of the shared memory loads in this context while allowing you to save the overhead of __syncthreads(). If you go back to the code that you used as a template, are the shared memory operands in that code using the “volatile” modifier?

Depending on context, a strictly binary reduction may not be the best approach for very small arrays. I have in the past made good use of the following two-phase approach: Have two to four threads scan and reduce their portions of a small array, then use a single thread to combine the partial results from these threads [with a single intervening __syncthreads()].

CudaaduC · July 13, 2013, 3:44am

declaring the shared memory as ‘volatile’ also seems to work, and seems to be a bit faster.

Thanks again…

Topic		Replies	Views
Kernel is massivly slower when compiling without the "-G" flag CUDA Programming and Performance	3	704	June 21, 2016
compiling with / without "-G" gives me DIFFERENT result CUDA Programming and Performance	4	1500	January 23, 2010
debug build versus release build CUDA Programming and Performance	9	1740	June 24, 2014
Different behavior on omitting --device-debug CUDA Programming and Performance	2	569	February 16, 2017
Bad GPU performance when compiling with -G parameter with nvcc compiler CUDA Programming and Performance	3	692	May 12, 2014
different output when compiled for emulation, device, and device with -g -G CUDA Programming and Performance	7	2988	October 26, 2009
nvcc compilation using -G CUDA Programming and Performance	4	6859	September 30, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009
Weird Problem while using -G0 flag CUDA Programming and Performance	2	3426	December 16, 2010
Compiling with debug flag gives errors while normal compilation work well Nsight Visual Studio Edition	8	4166	April 30, 2018

Different results when using GPU debug option -G

Related topics