Why is the result different when I compile with release version

I’m using Visual Studio 2017 and CUDA 11.1 to do an algorithm, which is very similar with reduction algorithm. And when I use debug mode to compile, the result looks like good. But when I use release mode to compile, calculation becomes very fast but the result sometimes becomes very strange.

And then I make a for loop of this algorithm and find that even in debug mode the result sometimes also becomes very strange. Algorithm is something like this( an algorithm to calculate mean value of image)

template<typename T>
__device__ void customAdd(T* sdata, T* g_odata) {
	int tx = threadIdx.x;
	int ty = threadIdx.y;
	int tid = ty * blockDim.x + tx;
	// do reduction in shared mem
	if (tid < 512) { sdata[tid] += sdata[tid + 512]; } __syncthreads();
	if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads();
	if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads();
	if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads();
	if (tid < 32) { sdata[tid] += sdata[tid + 32]; }__syncthreads();
	if (tid < 16) { sdata[tid] += sdata[tid + 16]; }__syncthreads();
	if (tid < 8) { sdata[tid] += sdata[tid + 8]; }__syncthreads();
	if (tid < 4) { sdata[tid] += sdata[tid + 4]; }__syncthreads();
	if (tid < 2) { sdata[tid] += sdata[tid + 2]; }__syncthreads();
	if (tid < 1) { sdata[tid] += sdata[tid + 1]; }__syncthreads();
	// write result for this block to global mem
	if (tid == 0) { atomicAdd(g_odata, sdata[tid]); }

}

#pragma region ProcImage

__global__ void cuda_defcan1() {
	int tx = threadIdx.x;
	int ty = threadIdx.y;
	int tid = ty * blockDim.x + tx;
	int x = blockIdx.x*blockDim.x + threadIdx.x;
	int y = blockIdx.y*blockDim.y + threadIdx.y;
	if ((y >= ROW) || (x >= COL)) {
		return;
	}

	/* definite canonicalization */
	int margine = CANMARGIN / 2;
	int condition = ((x >= margine && y >= margine) &&
		(x < COL - margine) && (y < ROW - margine) &&
		d_image1[y][x] != WHITE);

	double this_pixel = condition * (double)d_image1[y][x];
	__shared__ double sdata[3][32*32];
	sdata[0][tid] = this_pixel;
	sdata[1][tid] = this_pixel * this_pixel;
	sdata[2][tid] = condition;

	__syncthreads();

	customAdd(sdata[0], d_cuda_defcan_vars);
	customAdd(sdata[1], d_cuda_defcan_vars + 1);
	customAdd(sdata[2], d_cuda_defcan_vars + 2);
}

(1) The results are “sometimes strange” how exactly?
(2) When you run your application under control of cuda-memcheck, are any issues reported?
(3) Your code uses atomicAdd() which means data is summed in undefined order. Floating-point addition is not associative, which means that the result of a summation changes when operands are added in a different order. The different timing between debug and release builds can easily cause a different summation order.

Thank you for your replying.

(1)There is an 100 times iteration in my code, total code is something like

for(int i = 0; i < 0; i ++)
{
gpuReductionCode<<<grid,block>>>();

some other cpu code here;
}

in debug mode the result is

iter = 0, new col. = 0.221465 dnn = -0.000014 var = 0.069942 (d2 = 0.000000)
iter = 1, new col. = 0.194229 dnn = 2.645678 var = 2197860438.986840 (d2 = 0.000000)
iter = 2, new col. = 0.334867 dnn = 2.599718 var = 0.063496 (d2 = 0.000000)
iter = 3, new col. = 0.584733 dnn = 2.481434 var = 0.065760 (d2 = 0.000000)
iter = 4, new col. = 0.696429 dnn = 2.461145 var = 0.072179 (d2 = 0.000000)
iter = 5, new col. = 0.698093 dnn = -0.000085 var = 0.073374 (d2 = 0.000000)

it’s not totally same with cpu algorithm but it looks like fine(now col is increasing). But when I use release mode, the result sometimes becomes

iter = 0, new col. = 0.024043 dnn = 0.010340 var = 5529828.680412 (d2 = 0.000000)
iter = 1, new col. = 0.024715 dnn = 1.956203 var = 4156.936046 (d2 = 0.000000)
iter = 2, new col. = 0.030687 dnn = 0.006405 var = 0.116142 (d2 = 0.000000)
iter = 3, new col. = 0.024536 dnn = -0.000634 var = 10833.618777 (d2 = 0.000000)
iter = 4, new col. = 0.023882 dnn = -0.002992 var = 1106415.474022 (d2 = 0.000000)
iter = 5, new col. = 0.023463 dnn = -0.021967 var = 49645.231890 (d2 = 0.000000)
iter = 6, new col. = 0.023898 dnn = 1.879580 var = 921.055858 (d2 = 0.000000)
iter = 7, new col. = 0.094352 dnn = 0.007952 var = 0.125804 (d2 = 0.000000)
iter = 8, new col. = 0.079739 dnn = -0.014220 var = 7028.755756 (d2 = 0.000000)
iter = 9, new col. = 0.053879 dnn = -0.001308 var = 2197.894398 (d2 = 0.000000)
iter = 10, new col. = 0.080534 dnn = -0.003767 var = 259711.043346 (d2 = 0.000000)
iter = 11, new col. = -nan(ind) dnn = 1.870791 var = 31314.098753 (d2 = 0.000000)
iter = 12, new col. = 0.000000 dnn = 2.332558 var = 0.126989 (d2 = 0.000000)
iter = 13, new col. = -nan(ind) dnn = 3.478238 var = 0.081687 (d2 = 0.000000)

I know there must be something happened in execution, but I don’t know what it is. And then for finding reason I change my code as following

for(int i = 0; i < 0; i ++)
{
for(int j= 0; j < 0; j ++)
{
gpuReductionCode<<<grid,block>>>(); //use same input data
}
some other cpu code here;
}

And I found even in debug mode, the result may become very strange like

iter = 0, new col. = 0.083677 dnn = 0.000084 var = 158865508.476958 (d2 = 0.000000)
iter = 1, new col. = 0.047638 dnn = 0.000071 var = 62387371.551751 (d2 = 0.000000)
iter = 2, new col. = 0.059766 dnn = -0.000085 var = 88351711.902566 (d2 = 0.000000)
iter = 3, new col. = 0.056602 dnn = 0.000630 var = 62231019.737177 (d2 = 0.000000)
iter = 4, new col. = 0.051207 dnn = -0.000047 var = 1120595.476563 (d2 = 0.000000)
iter = 5, new col. = 0.051836 dnn = 2.244428 var = 199214381.260985 (d2 = 0.000000)

(2)I enabled cuda-memcheck in visual studio and find a misaligned access problem, because d_cuda_defcan_vars was device double d_cuda_defcan_vars[3]. And I change it to device double d_cuda_defcan_vars[4], misaligned access problem disappeared but result is still strange. So, I think this issue is not the problem in my code.

So, I think this issue is not the problem in my code.

Decades of experience as a software developer tell me that it is unwise to suspect an error due to third-party software and much wiser to assume an error in one’s own code.

Compiler errors do happen but are rare. A much more common scenario is that release builds expose latent bugs in code. Note that having no errors reported by cuda-memcheck is necessary but not sufficient to assume correctly working code.

Sorry for my poor English that make you misunderstanding. I don’t suspect third-party software here.

I just mean since misaligned access problem disappeared, so misaligned is not the reason which caused the strange result. I know there must be some mistakes in my code, but I just can’t find where it is.

Later I’ll try to change atomicAdd to another sum reduction kernel to find whether atomicAdd is the reason that cause the strange result.

One technique used in debugging is to reduce code and data until one is left with the smallest program that reproduced the issue. Usually the root cause of problems is easier to track down when there is very little code and data to consider.

You would want to increase observability, for example by printing out key data items (which you can collect in a log file), or by dumping important intermediate data into a buffer to be examined later.

Keep conceptual issues in mind, such as the floating-point properties and the non-deterministic behavior of summing with atomic additions I pointed out above.

It would be easier to help if you posted a more complete example that people can compile and run. Best if is stripped to the bare minimum code to reproduce the problem.