Weird result difference between release and debug even with -fmad=false

kmhmon · June 29, 2022, 12:59pm

Recently I am studying cuda application.
But when I use cuda on 2D second partial derivative calculation, it gives result mismatch between release build and debug build.

A search said that it could be solved by using the -fmad=false option but it doesn’t work in this case.

The result of release build is same as that of debug build only if I decrease operation like tot = dxx * dxx + dyy * dyy or tot = dxx * dxx - dxy * dxy (see following code).
So I guess release build neglects some operations or may be related to parameters limitations of global function.

The sample code uses 2048x2048 2d (flatten) random data and pick only the data of a given width.
edit: size of input 2048x2048 → 256x256

Tested environments are

ubuntu 20.04 / cuda 11.2.2 / nvidia 510.68 / gcc-7, 10 / rtx2070

or

redhat 8.3 / cuda 11.2.2, 11.2.0, 11.3 / nvidia 460 (larger than 460.32) / gcc-8, 10 / v100

When I test same code with cuda 11.1.1 or 10.1, there is no difference between release and debug.

I don’t know why there is a difference.

Code:
edit: size of input and platform independent data

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <fstream>

const int width = 256;
const int thread = 128;
const int vw = 128;
const int vh = 128;

__global__ void partial_d
(float* __restrict__ out, int ow, int ox1, int oy1,
 const float* __restrict__ in, int iw, int ix1, int iy1,
 int vw, int vh)
{
	int thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
	if(thread_idx >= vw * vh) return;

	int j = thread_idx / vw;
	int i = thread_idx % vw;

	int in_idx = (j + iy1) * iw + (i + ix1);
	int out_idx = (j + oy1) * ow + (i + ox1);

	float dxx = in[in_idx + 1] + in[in_idx - 1] - in[in_idx] * 2.0;
	float dyy = in[in_idx + iw] + in[in_idx - iw] - in[in_idx] * 2.0;
	float dxy = in[in_idx + 1 + iw] + in[in_idx - 1 - iw]
		      - in[in_idx + 1 - iw] - in[in_idx - 1 + iw];
	dxy *= 0.25;

	float tot = dxx * dxx + dyy * dyy - dxy * dxy;
	 
	out[out_idx] += tot;
}

int main(int argc, char** argv)
{
	float *in, *out;

	in = (float*) malloc(sizeof(float) * width * width);
	out = (float*) malloc(sizeof(float) * width * width);

	for(int i = 0; i < width * width; i++)
	{
		in[i] = (float) abs(i * i - width * width )/10000000.f;
		out[i] = 0.f;
	}

	float *d_in, *d_out;

	cudaMalloc((void**)&d_in, sizeof(float) * width * width);
	cudaMalloc((void**)&d_out, sizeof(float) * width * width);

	cudaMemcpy(d_in, in, sizeof(float) * width * width, cudaMemcpyHostToDevice);
	cudaMemcpy(d_out, out, sizeof(float) * width * width, cudaMemcpyHostToDevice);

	int ix1, iy1, ox1, oy1;
	ix1 = 1;
	iy1 = 1;
	ox1 = 1;
	oy1 = 1;

	int block_num = (vw * vh + thread - 1) / thread;
	partial_d<<<block_num, thread>>>
		(d_out, width, ox1, oy1, 
		 d_in, width, ix1, iy1, 
		 vw, vh);
	cudaDeviceSynchronize();

	cudaMemcpy(in, d_in, sizeof(float) * width * width, cudaMemcpyDeviceToHost);
	cudaMemcpy(out, d_out, sizeof(float) * width * width, cudaMemcpyDeviceToHost);

	printf("%.28f %.28f \n",in[2062],out[2062]);

	std::ofstream save(argv[1],std::ios::trunc);
	save.precision(28);
	for(int i=0; i < 10000; i++)
	{
		save << i << " " << in[i] << " " << out[i] << std::endl;
	}
	save.close();

	cudaFree(d_in);
	cudaFree(d_out);

	return 0;
}

Run script:

nvcc -ccbin g++-10 -fmad=false -G test.cu -o test;
./test ~/debug.csv;
nvcc -ccbin g++-10 -fmad=false test.cu -o test;
./test ~/release.csv;

Output:

0.4186308085918426513671875000 0.0001717955601634457707405090
0.4186308085918426513671875000 0.0001717981795081868767738342

njuffa · June 29, 2022, 4:05pm

The use of rand() to initialize the input data is unfortunate, because that means people trying to reproduce your observation on different platforms are going to see entirely different results. You may want to fix that, and also cut down the size of the input data to the minimum necessary to trigger your observation.

On a Windows 10 machine with CUDA toolchain 11.1.105 and Quadro RTX 4000, I confirmed that nvcc -o test.exe -arch=sm_75 -fmad=false test.cu (release build with FMA-merging disabled) and nvcc -o test.exe -arch=sm_75 -G test.cu (debug build) produce bit-identical results, meaning every number in the save file matches, as well as what is sent to stdout. There are differences between debug and release builds when I build without -fmad=false. This is as I would expect it and would appear to match the behavior you observed for toolchain 11.1.

I do not have a CUDA 11.2 toolchain available for comparison purposes. The code in the kernel is simple enough and off-hand I cannot see a reason why the results from debug and release builds would not match when -fmad=false is used. Without FMA merging, the order of floating-point computations (and the additions / subtractions in particular) in this code should be fully determined by the “left to right” evaluation order of C++.

Historically, by observation, the CUDA toolchain has not applied any re-association of floating-point expressions other than FMA merging, which is often important for performance. I guess there is a possibility that this changed, either on purpose or accidentally. Have you had a chance to try your experiment with the latest CUDA version (11.7)? Best I know, NVIDIA does not provide a guarantee anywhere that floating-point results will match bit-for-bit between debug and release builds, with or without -fmad flags being specified.

Robert_Crovella · June 29, 2022, 9:08pm

On CUDA 11.4/Centos 7/gcc 7.3.1/TeslaV100 I don’t see any difference in the printf output. Both of the outputs end in 5090.

On the same setup using CUDA 11.2, I see the difference. The debug output ends in 5090 and the release output ends in 8342.

You may wish to switch to a newer CUDA version.

kmhmon · June 29, 2022, 9:13pm

Thanks for reply :)
I changed code platform independently and decrease dimension of input data and remove random behavior.

When I try this code with cuda 11.7, I saw there is no difference.

kmhmon · June 29, 2022, 9:17pm

Thanks!

Now I can know CUDA 11.4 gives same result.

Is this some kind of bug with specific CUDA version?

Robert_Crovella · June 29, 2022, 9:28pm

There might be, I don’t know. I suppose its possible there might be some other explanation. If there were a bug with that version, and you were looking for a fix, and a newer version demonstrably fixes the issue, you would be advised to switch to a newer version.

kmhmon · June 30, 2022, 2:29am

Redefine int variables in global function to unsigned int, the problem is solved. It seems like that there is some limit related to the number of argument or memory…

Again, as you said, when I run this code with cuda-11.4.1, there is no difference.

njuffa · June 30, 2022, 3:04am

Using unsigned int instead int can interfere with some compiler optimizations and may create enough of a “disturbance” to make the issue disappear. As the observed issue seems to be localized to CUDA 11.2 from the data available in this thread, I would think (speculation!) that the observation was likely due to some sort of temporary bug in that version of the tool chain that has since been fixed.

kmhmon · June 30, 2022, 12:17pm

One more trial:

Use of int, int32_t, int16_t in global → problem exists.
Use of unsigned int, int64_t in global → problem solved.
If one uses shifted in_idx, even with int variables, problem is solved!

...
	int in_idx = (j + iy1) * iw + (i + ix1) - 1;
	int out_idx = (j + oy1) * ow + (i + ox1);

	float dxx = in[in_idx + 2] + in[in_idx] - in[in_idx + 1] * 2.0;
	float dyy = in[in_idx + iw + 1] + in[in_idx - iw + 1] - in[in_idx + 1] * 2.0;
	float dxy = in[in_idx + 2 + iw] + in[in_idx - iw]
			      - in[in_idx + 2 - iw] - in[in_idx + iw];
...

Topic		Replies	Views
Why is the result different when I compile with release version CUDA Programming and Performance	6	617	November 24, 2020
Different results in Debug and Release mode compile CUDA Programming and Performance	9	4230	October 20, 2015
Release Mode No values CUDA Programming and Performance cuda	6	747	September 22, 2020
Why does Debug & Release get different result CUDA Programming and Performance	14	9821	May 11, 2013
Release and Debug modes on CUDA 5.0 CUDA Programming and Performance	2	1050	January 9, 2014
Strange behaviour of Cuda release build CUDA Programming and Performance	6	617	November 3, 2021
Inconsistency between the release and debug builds of the same CUDA kernel CUDA Programming and Performance	3	380	July 27, 2023
Debugger error. CUDA Programming and Performance	3	618	December 29, 2016
Floating point operations IEE compliance and debug mode CUDA Programming and Performance	3	1056	April 4, 2013
result of emdebug and release CUDA Programming and Performance	0	1407	July 13, 2009

Weird result difference between release and debug even with -fmad=false

Related topics