Weird result difference between release and debug even with -fmad=false

Recently I am studying cuda application.
But when I use cuda on 2D second partial derivative calculation, it gives result mismatch between release build and debug build.

A search said that it could be solved by using the -fmad=false option but it doesn’t work in this case.

The result of release build is same as that of debug build only if I decrease operation like tot = dxx * dxx + dyy * dyy or tot = dxx * dxx - dxy * dxy (see following code).
So I guess release build neglects some operations or may be related to parameters limitations of global function.

The sample code uses 2048x2048 2d (flatten) random data and pick only the data of a given width.
edit: size of input 2048x2048 → 256x256

Tested environments are

ubuntu 20.04 / cuda 11.2.2 / nvidia 510.68 / gcc-7, 10 / rtx2070

or

redhat 8.3 / cuda 11.2.2, 11.2.0, 11.3 / nvidia 460 (larger than 460.32) / gcc-8, 10 / v100

When I test same code with cuda 11.1.1 or 10.1, there is no difference between release and debug.

I don’t know why there is a difference.

Code:
edit: size of input and platform independent data

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <fstream>

const int width = 256;
const int thread = 128;
const int vw = 128;
const int vh = 128;

__global__ void partial_d
(float* __restrict__ out, int ow, int ox1, int oy1,
 const float* __restrict__ in, int iw, int ix1, int iy1,
 int vw, int vh)
{
	int thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
	if(thread_idx >= vw * vh) return;

	int j = thread_idx / vw;
	int i = thread_idx % vw;

	int in_idx = (j + iy1) * iw + (i + ix1);
	int out_idx = (j + oy1) * ow + (i + ox1);

	float dxx = in[in_idx + 1] + in[in_idx - 1] - in[in_idx] * 2.0;
	float dyy = in[in_idx + iw] + in[in_idx - iw] - in[in_idx] * 2.0;
	float dxy = in[in_idx + 1 + iw] + in[in_idx - 1 - iw]
		      - in[in_idx + 1 - iw] - in[in_idx - 1 + iw];
	dxy *= 0.25;

	float tot = dxx * dxx + dyy * dyy - dxy * dxy;
	 
	out[out_idx] += tot;
}

int main(int argc, char** argv)
{
	float *in, *out;

	in = (float*) malloc(sizeof(float) * width * width);
	out = (float*) malloc(sizeof(float) * width * width);

	for(int i = 0; i < width * width; i++)
	{
		in[i] = (float) abs(i * i - width * width )/10000000.f;
		out[i] = 0.f;
	}

	float *d_in, *d_out;

	cudaMalloc((void**)&d_in, sizeof(float) * width * width);
	cudaMalloc((void**)&d_out, sizeof(float) * width * width);

	cudaMemcpy(d_in, in, sizeof(float) * width * width, cudaMemcpyHostToDevice);
	cudaMemcpy(d_out, out, sizeof(float) * width * width, cudaMemcpyHostToDevice);

	int ix1, iy1, ox1, oy1;
	ix1 = 1;
	iy1 = 1;
	ox1 = 1;
	oy1 = 1;

	int block_num = (vw * vh + thread - 1) / thread;
	partial_d<<<block_num, thread>>>
		(d_out, width, ox1, oy1, 
		 d_in, width, ix1, iy1, 
		 vw, vh);
	cudaDeviceSynchronize();

	cudaMemcpy(in, d_in, sizeof(float) * width * width, cudaMemcpyDeviceToHost);
	cudaMemcpy(out, d_out, sizeof(float) * width * width, cudaMemcpyDeviceToHost);

	printf("%.28f %.28f \n",in[2062],out[2062]);

	std::ofstream save(argv[1],std::ios::trunc);
	save.precision(28);
	for(int i=0; i < 10000; i++)
	{
		save << i << " " << in[i] << " " << out[i] << std::endl;
	}
	save.close();

	cudaFree(d_in);
	cudaFree(d_out);

	return 0;
}

Run script:

nvcc -ccbin g++-10 -fmad=false -G test.cu -o test;
./test ~/debug.csv;
nvcc -ccbin g++-10 -fmad=false test.cu -o test;
./test ~/release.csv;

Output:

0.4186308085918426513671875000 0.0001717955601634457707405090
0.4186308085918426513671875000 0.0001717981795081868767738342

The use of rand() to initialize the input data is unfortunate, because that means people trying to reproduce your observation on different platforms are going to see entirely different results. You may want to fix that, and also cut down the size of the input data to the minimum necessary to trigger your observation.

On a Windows 10 machine with CUDA toolchain 11.1.105 and Quadro RTX 4000, I confirmed that nvcc -o test.exe -arch=sm_75 -fmad=false test.cu (release build with FMA-merging disabled) and nvcc -o test.exe -arch=sm_75 -G test.cu (debug build) produce bit-identical results, meaning every number in the save file matches, as well as what is sent to stdout. There are differences between debug and release builds when I build without -fmad=false. This is as I would expect it and would appear to match the behavior you observed for toolchain 11.1.

I do not have a CUDA 11.2 toolchain available for comparison purposes. The code in the kernel is simple enough and off-hand I cannot see a reason why the results from debug and release builds would not match when -fmad=false is used. Without FMA merging, the order of floating-point computations (and the additions / subtractions in particular) in this code should be fully determined by the “left to right” evaluation order of C++.

Historically, by observation, the CUDA toolchain has not applied any re-association of floating-point expressions other than FMA merging, which is often important for performance. I guess there is a possibility that this changed, either on purpose or accidentally. Have you had a chance to try your experiment with the latest CUDA version (11.7)? Best I know, NVIDIA does not provide a guarantee anywhere that floating-point results will match bit-for-bit between debug and release builds, with or without -fmad flags being specified.

On CUDA 11.4/Centos 7/gcc 7.3.1/TeslaV100 I don’t see any difference in the printf output. Both of the outputs end in 5090.

On the same setup using CUDA 11.2, I see the difference. The debug output ends in 5090 and the release output ends in 8342.

You may wish to switch to a newer CUDA version.

Thanks for reply :)
I changed code platform independently and decrease dimension of input data and remove random behavior.

When I try this code with cuda 11.7, I saw there is no difference.

Thanks!

Now I can know CUDA 11.4 gives same result.

Is this some kind of bug with specific CUDA version?

There might be, I don’t know. I suppose its possible there might be some other explanation. If there were a bug with that version, and you were looking for a fix, and a newer version demonstrably fixes the issue, you would be advised to switch to a newer version.

Redefine int variables in global function to unsigned int, the problem is solved. It seems like that there is some limit related to the number of argument or memory…

Again, as you said, when I run this code with cuda-11.4.1, there is no difference.

Using unsigned int instead int can interfere with some compiler optimizations and may create enough of a “disturbance” to make the issue disappear. As the observed issue seems to be localized to CUDA 11.2 from the data available in this thread, I would think (speculation!) that the observation was likely due to some sort of temporary bug in that version of the tool chain that has since been fixed.

One more trial:

  • Use of int, int32_t, int16_t in global → problem exists.

  • Use of unsigned int, int64_t in global → problem solved.

  • If one uses shifted in_idx, even with int variables, problem is solved!

...
	int in_idx = (j + iy1) * iw + (i + ix1) - 1;
	int out_idx = (j + oy1) * ow + (i + ox1);

	float dxx = in[in_idx + 2] + in[in_idx] - in[in_idx + 1] * 2.0;
	float dyy = in[in_idx + iw + 1] + in[in_idx - iw + 1] - in[in_idx + 1] * 2.0;
	float dxy = in[in_idx + 2 + iw] + in[in_idx - iw]
			      - in[in_idx + 2 - iw] - in[in_idx + iw];
...