Floating-point precision problems

maxest · December 23, 2010, 8:16pm

Hey,

I’ve experienced some serious floating-point problems. Here’s the thing: I have a software renderer. To speed up pixel processing I decided to employ CUDA. My renderer is a traditional object-order renderer like OpenGL and I even use almost the same view/projection matrices as in OpenGL. In the attached pictures you can see a dramatic difference in image quality on CPU and on GPU (using CUDA) using the same perspctive matrix having near plane set to 0.5 and far plane to 1000. CPU does its job really well on 32-bit floats whereas CUDA have really serious problems. Where do these come from? I haven’t experienced similar difficulties when I used OpenGL or Direct3D

tera · December 23, 2010, 11:41pm

It is difficult to answer this without any source code. One difference that comes to mind though is that pre-Fermi GPUs don’t support denormals and always flush to zero.

maxest · December 24, 2010, 10:32am

To be honest I dont think any piece of code could change anything. I execute the exact same code on both CPU and CUDA on the exact same type of memory (unsigned char for color buffer, float for depth buffer). But if this might really help, here it is (a function that processes one pixel of a triangle (the triangle’s attributes are passed through arguments of the function)):

__global__ void rasterize(byte *colorBuffer, float *depthBuffer, [long list of other attributes])

{

	int y = minY + threadIdx.y + blockIdx.y*blockDim.y;

	int x = minX + threadIdx.x + blockIdx.x*blockDim.x;

	if (x >= minX && x <= maxX && y >= minY && y <= maxY)

	{

		// affine barycentric weights

		float a = implicitLine_device((float)x, (float)y, v1.position, v2.position) * one_over_v0ToLine12;

		float b = implicitLine_device((float)x, (float)y, v2.position, v0.position) * one_over_v1ToLine20;

		float c = implicitLine_device((float)x, (float)y, v0.position, v1.position) * one_over_v2ToLine01;

		// if pixel (x, y) is inside the triangle or on one of its edges

		if (a >= 0 && b >= 0 && c >= 0)

		{

			// this prevents from drawing the same pixel twice (belonging to adjacent triangles)

			if ( ( (a == 0) && (dot12 > 0) ) ||

				 ( (b == 0) && (dot20 > 0) ) ||

				 ( (c == 0) && (dot01 > 0) ) )

			{

				return;

			}

			int pixelIndex = y*width + x;

			float z_affine = a*v0.position.z + b*v1.position.z + c*v2.position.z;

			if (z_affine < depthBuffer[pixelIndex] && z_affine <= 1.0f)

			{

				// perspective-correct barycentric weights

				float d = a*one_over_h0 + b*one_over_h1 + c*one_over_h2;

				d = 1.0f / d;

				a *= d*one_over_h0;

				b *= d*one_over_h1;

				c *= d*one_over_h2;

				// attributes interpolation

				vec3 color_persp = a*v0.color + b*v1.color + c*v2.color;

				vec2 texCoord_persp = a*v0.texCoord + b*v1.texCoord + c*v2.texCoord;

				// run pixel shader

				vec3 pixelColor = color_persp;

				// clamp bytes to 255

				byte red = (byte)(255.0f * MIN(pixelColor.x, 1.0f));

				byte green = (byte)(255.0f * MIN(pixelColor.y, 1.0f));

				byte blue = (byte)(255.0f * MIN(pixelColor.z, 1.0f));

				// update buffers

				colorBuffer[4*pixelIndex + 0] = blue;

				colorBuffer[4*pixelIndex + 1] = green;

				colorBuffer[4*pixelIndex + 2] = red;

				depthBuffer[pixelIndex] = z_affine;

			}

		}

	}

}

byte == unsigned char

Lev · December 24, 2010, 1:18pm

To be honest I dont think any piece of code could change anything. I execute the exact same code on both CPU and CUDA on the exact same type of memory (unsigned char for color buffer, float for depth buffer). But if this might really help, here it is (a function that processes one pixel of a triangle (the triangle’s attributes are passed through arguments of the function)):

__global__ void rasterize(byte *colorBuffer, float *depthBuffer, [long list of other attributes])

{

	int y = minY + threadIdx.y + blockIdx.y*blockDim.y;

	int x = minX + threadIdx.x + blockIdx.x*blockDim.x;

	if (x >= minX && x <= maxX && y >= minY && y <= maxY)

	{

		// affine barycentric weights

		float a = implicitLine_device((float)x, (float)y, v1.position, v2.position) * one_over_v0ToLine12;

		float b = implicitLine_device((float)x, (float)y, v2.position, v0.position) * one_over_v1ToLine20;

		float c = implicitLine_device((float)x, (float)y, v0.position, v1.position) * one_over_v2ToLine01;

		// if pixel (x, y) is inside the triangle or on one of its edges

		if (a >= 0 && b >= 0 && c >= 0)

		{

			// this prevents from drawing the same pixel twice (belonging to adjacent triangles)

			if ( ( (a == 0) && (dot12 > 0) ) ||

				 ( (b == 0) && (dot20 > 0) ) ||

				 ( (c == 0) && (dot01 > 0) ) )

			{

				return;

			}

			int pixelIndex = y*width + x;

			float z_affine = a*v0.position.z + b*v1.position.z + c*v2.position.z;

			if (z_affine < depthBuffer[pixelIndex] && z_affine <= 1.0f)

			{

				// perspective-correct barycentric weights

				float d = a*one_over_h0 + b*one_over_h1 + c*one_over_h2;

				d = 1.0f / d;

				a *= d*one_over_h0;

				b *= d*one_over_h1;

				c *= d*one_over_h2;

				// attributes interpolation

				vec3 color_persp = a*v0.color + b*v1.color + c*v2.color;

				vec2 texCoord_persp = a*v0.texCoord + b*v1.texCoord + c*v2.texCoord;

				// run pixel shader

				vec3 pixelColor = color_persp;

				// clamp bytes to 255

				byte red = (byte)(255.0f * MIN(pixelColor.x, 1.0f));

				byte green = (byte)(255.0f * MIN(pixelColor.y, 1.0f));

				byte blue = (byte)(255.0f * MIN(pixelColor.z, 1.0f));

				// update buffers

				colorBuffer[4*pixelIndex + 0] = blue;

				colorBuffer[4*pixelIndex + 1] = green;

				colorBuffer[4*pixelIndex + 2] = red;

				depthBuffer[pixelIndex] = z_affine;

			}

		}

	}

}

byte == unsigned char

you probably use -fast_math switch, use full precision functions for division and square roots if you have them.

maxest · December 24, 2010, 1:27pm

I don’t think I use such a thing. In the CUDA Build Rule (from CUDA Toolkit) there is some “Use fast math” switch which is set to “false”. However, it is available only on the “Host” page so I don’t think it has any meaning to code being executed on GPU. Here’s what the build rule produces:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 32 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MD  " -I"..\..\..\..\include" -I"..\..\..\..\..\BlossomEngine\common\include" -I"..\..\..\..\..\BlossomEngine\math\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -maxrregcount=32  --compile -o "Release/renderer.cu.obj" renderer.cu

Lev · December 24, 2010, 3:42pm

I don’t think I use such a thing. In the CUDA Build Rule (from CUDA Toolkit) there is some “Use fast math” switch which is set to “false”. However, it is available only on the “Host” page so I don’t think it has any meaning to code being executed on GPU. Here’s what the build rule produces:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 32 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MD  " -I"..\..\..\..\include" -I"..\..\..\..\..\BlossomEngine\common\include" -I"..\..\..\..\..\BlossomEngine\math\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -maxrregcount=32  --compile -o "Release/renderer.cu.obj" renderer.cu

try add clamping from zero.

maxest · December 25, 2010, 1:52pm

Could you be more specific, please? :) You mean some compiler’s option? I don’t see any such in the compiler’s documentation

Lev · December 25, 2010, 2:50pm

I mean clamp bytes to 0…255, you only check upper side.

tera · December 27, 2010, 2:31pm

What happens if you comment out this part?

// this prevents from drawing the same pixel twice (belonging to adjacent triangles)

			if ( ( (a == 0) && (dot12 > 0) ) ||

				 ( (b == 0) && (dot20 > 0) ) ||

				 ( (c == 0) && (dot01 > 0) ) )

			{

				return;

			}

Testing floating point variables for exact equality almost always is the wrong thing to do. And flush-to-zero could explain why this would trigger more often on the GPU than on the CPU.

maxest · January 7, 2011, 2:34pm

Unfortunately, it does not help. I have totally run out of ideas. I really do exactly the same things on both CPU and GPU. Have you not experienced floating-point problems on GPU? Because it must be it - floats near the far plane all are tightly packed (as depth-buffer values form a sort of 1/z function).

Lev · January 7, 2011, 3:51pm

try to debug your code at those points and see what is going on.

satyam_shivam · January 7, 2011, 6:59pm

Hi,

try replacing 0 with 0.f in the following lines:

if (a >= 0 && b >= 0 && c >= 0)
      
if ( ( (a == 0) && (dot12 > 0) ) || ( (b == 0) && (dot20 > 0) ) || ( (c == 0) && (dot01 > 0) ) )

It will be more helpful if you could paste the entire code, if you can !

maxest · January 7, 2011, 7:24pm

I replaced - no change in speed.
How can I poste more code if since have posted the whole global function? :)

njuffa · January 7, 2011, 7:45pm

The code posted does not seem to be the complete code, so by necessity the following discussion will have to remain somewhat speculative. I assume it has been established that the inputs to the function shown are identical in the CPU and GPU versions, and that data corruption (e.g. due to out of bounds accesses, race conditions) has been ruled out both by code inspection and testing.

It is not clear to me how sensitive the computation is to small numerical differences, but there are multiple reasons why intermediate GPU and CPU results may be slightly different. This in turn may cause differences in the output if such sensitivity exists. BTW, is this running on an sm_1x or sm_2x platform?

(1) On sm_1x platforms, and on sm_2x platforms with -ftz=true, very small numbers (so called denormals) are flushed to zero when running on the GPU. While such an FTZ mode also exists in SSE, it is unlikely that it is turned on by default, and the host compiler may not even target SSE. In my experience, this is an infrequent issue (with a few exceptions, code is not usually prone to frequent underflows).

(2) On sm_1x platforms, and on sm_2x platforms with -prec-div=false, the following results in an approximate, not IEEE-rounded reciprocal:

d = 1.0f / d;

On the CPU, an IEEE-rounded reciprocal would likely be generated here. For the sake of experiment, you could replace this as follows to force a reciprocal with IEEE round to-nearest-or-even on the GPU:

d = __frcp_rn(d);

(3) There are multiple opportunities in this code for the compiler to generate either FMADs (on sm_1x), or FFMAs (on sm_2x), which are merged multiply-and-add operations. For example, the following expression naturally lends itself to the use of FMAD/FMA:

vec3 color_persp = a*v0.color + b*v1.color + c*v2.color;

The results from either FMAD or FFMA are not generally identical to a sequence of FMUL followed by FADD, as would be used on the CPU. FFMA on sm_2x follows the IEEE-754 (2008) specification for fused multiply-add and will in general provide better accuracy than the combination of FMUL and FADD. To force the use of separate FMUL and FADD, one can use the __fmul_rn(), __fadd_rn() device functions, the use of which prevents the compiler from merging these operations into FMAD/FFMA.

(4) Even though the code uses float variables throughout, depending on how it is compiled on the CPU, much of the CPU computation may actually be performed at higher precision. In particular this would likely be the case if the compiler targets the x87 FPU [i.e. not SSE], in which case most floating-point operations would map to in-register computation that is performed with double-precision (on Windows) or extended-precision (on Linux).

While there are host compiler flags designed to reduce “excess precision” being carried in temporary variables (I think /Op for MSVC, -ffloat-store for gcc), the only way to test that hypothesis rigorously on the CPU is to declare all float variables volatile, and also break up the computations into individual operations. For example, this

float z_affine = a*v0.position.z + b*v1.position.z + c*v2.position.z;

would be turned into this:

volatile float z_affine;

z_affine  = a*v0.position.z;

z_affine += b*v1.position.z;

z_affine += c*v2.position.z;

In my personal experience, significant numerical differences observed between CPU and GPU computation on floats are often due to issue (4), but combinations of multiple of the issues mentioned above are also not uncommon.

cbuchner1 · January 7, 2011, 8:29pm

Wow, great explanations here. Thanks for taking the time to summarize all this.

Topic		Replies	Views
Floating Point Accuracy CUDA Programming and Performance	11	30425	April 6, 2013
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	21942	August 7, 2010
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10661	November 26, 2009
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	785	August 9, 2023
Floats and floats... difference between CPU and GPU? CUDA Programming and Performance	12	14098	February 2, 2010
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8226	September 9, 2011
precision CUDA Programming and Performance	3	2614	December 16, 2008
output difference between quadro K600 and K620 CUDA Programming and Performance	13	4883	December 2, 2014
Difference between Device emulation and execution modes CUDA Programming and Performance	8	4435	May 12, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	677	April 4, 2017

Floating-point precision problems

Related topics