Floating-point precision problems

Hey,

I’ve experienced some serious floating-point problems. Here’s the thing: I have a software renderer. To speed up pixel processing I decided to employ CUDA. My renderer is a traditional object-order renderer like OpenGL and I even use almost the same view/projection matrices as in OpenGL. In the attached pictures you can see a dramatic difference in image quality on CPU and on GPU (using CUDA) using the same perspctive matrix having near plane set to 0.5 and far plane to 1000. CPU does its job really well on 32-bit floats whereas CUDA have really serious problems. Where do these come from? I haven’t experienced similar difficulties when I used OpenGL or Direct3D
gpu.png
cpu.png

It is difficult to answer this without any source code. One difference that comes to mind though is that pre-Fermi GPUs don’t support denormals and always flush to zero.

To be honest I dont think any piece of code could change anything. I execute the exact same code on both CPU and CUDA on the exact same type of memory (unsigned char for color buffer, float for depth buffer). But if this might really help, here it is (a function that processes one pixel of a triangle (the triangle’s attributes are passed through arguments of the function)):

__global__ void rasterize(byte *colorBuffer, float *depthBuffer, [long list of other attributes])

{

	int y = minY + threadIdx.y + blockIdx.y*blockDim.y;

	int x = minX + threadIdx.x + blockIdx.x*blockDim.x;

	if (x >= minX && x <= maxX && y >= minY && y <= maxY)

	{

		// affine barycentric weights

		float a = implicitLine_device((float)x, (float)y, v1.position, v2.position) * one_over_v0ToLine12;

		float b = implicitLine_device((float)x, (float)y, v2.position, v0.position) * one_over_v1ToLine20;

		float c = implicitLine_device((float)x, (float)y, v0.position, v1.position) * one_over_v2ToLine01;

		// if pixel (x, y) is inside the triangle or on one of its edges

		if (a >= 0 && b >= 0 && c >= 0)

		{

			// this prevents from drawing the same pixel twice (belonging to adjacent triangles)

			if ( ( (a == 0) && (dot12 > 0) ) ||

				 ( (b == 0) && (dot20 > 0) ) ||

				 ( (c == 0) && (dot01 > 0) ) )

			{

				return;

			}

			int pixelIndex = y*width + x;

			float z_affine = a*v0.position.z + b*v1.position.z + c*v2.position.z;

			if (z_affine < depthBuffer[pixelIndex] && z_affine <= 1.0f)

			{

				// perspective-correct barycentric weights

				float d = a*one_over_h0 + b*one_over_h1 + c*one_over_h2;

				d = 1.0f / d;

				a *= d*one_over_h0;

				b *= d*one_over_h1;

				c *= d*one_over_h2;

				// attributes interpolation

				vec3 color_persp = a*v0.color + b*v1.color + c*v2.color;

				vec2 texCoord_persp = a*v0.texCoord + b*v1.texCoord + c*v2.texCoord;

				// run pixel shader

				vec3 pixelColor = color_persp;

				// clamp bytes to 255

				byte red = (byte)(255.0f * MIN(pixelColor.x, 1.0f));

				byte green = (byte)(255.0f * MIN(pixelColor.y, 1.0f));

				byte blue = (byte)(255.0f * MIN(pixelColor.z, 1.0f));

				// update buffers

				colorBuffer[4*pixelIndex + 0] = blue;

				colorBuffer[4*pixelIndex + 1] = green;

				colorBuffer[4*pixelIndex + 2] = red;

				depthBuffer[pixelIndex] = z_affine;

			}

		}

	}

}

byte == unsigned char

you probably use -fast_math switch, use full precision functions for division and square roots if you have them.

I don’t think I use such a thing. In the CUDA Build Rule (from CUDA Toolkit) there is some “Use fast math” switch which is set to “false”. However, it is available only on the “Host” page so I don’t think it has any meaning to code being executed on GPU. Here’s what the build rule produces:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 32 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MD  " -I"..\..\..\..\include" -I"..\..\..\..\..\BlossomEngine\common\include" -I"..\..\..\..\..\BlossomEngine\math\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -maxrregcount=32  --compile -o "Release/renderer.cu.obj" renderer.cu

try add clamping from zero.

Could you be more specific, please? :) You mean some compiler’s option? I don’t see any such in the compiler’s documentation

I mean clamp bytes to 0…255, you only check upper side.

What happens if you comment out this part?

// this prevents from drawing the same pixel twice (belonging to adjacent triangles)

			if ( ( (a == 0) && (dot12 > 0) ) ||

				 ( (b == 0) && (dot20 > 0) ) ||

				 ( (c == 0) && (dot01 > 0) ) )

			{

				return;

			}

Testing floating point variables for exact equality almost always is the wrong thing to do. And flush-to-zero could explain why this would trigger more often on the GPU than on the CPU.

Unfortunately, it does not help. I have totally run out of ideas. I really do exactly the same things on both CPU and GPU. Have you not experienced floating-point problems on GPU? Because it must be it - floats near the far plane all are tightly packed (as depth-buffer values form a sort of 1/z function).

try to debug your code at those points and see what is going on.

Hi,

try replacing 0 with 0.f in the following lines:

if (a >= 0 && b >= 0 && c >= 0)
      
if ( ( (a == 0) && (dot12 > 0) ) || ( (b == 0) && (dot20 > 0) ) || ( (c == 0) && (dot01 > 0) ) )

It will be more helpful if you could paste the entire code, if you can !

I replaced - no change in speed.
How can I poste more code if since have posted the whole global function? :)

The code posted does not seem to be the complete code, so by necessity the following discussion will have to remain somewhat speculative. I assume it has been established that the inputs to the function shown are identical in the CPU and GPU versions, and that data corruption (e.g. due to out of bounds accesses, race conditions) has been ruled out both by code inspection and testing.

It is not clear to me how sensitive the computation is to small numerical differences, but there are multiple reasons why intermediate GPU and CPU results may be slightly different. This in turn may cause differences in the output if such sensitivity exists. BTW, is this running on an sm_1x or sm_2x platform?

(1) On sm_1x platforms, and on sm_2x platforms with -ftz=true, very small numbers (so called denormals) are flushed to zero when running on the GPU. While such an FTZ mode also exists in SSE, it is unlikely that it is turned on by default, and the host compiler may not even target SSE. In my experience, this is an infrequent issue (with a few exceptions, code is not usually prone to frequent underflows).

(2) On sm_1x platforms, and on sm_2x platforms with -prec-div=false, the following results in an approximate, not IEEE-rounded reciprocal:

d = 1.0f / d;

On the CPU, an IEEE-rounded reciprocal would likely be generated here. For the sake of experiment, you could replace this as follows to force a reciprocal with IEEE round to-nearest-or-even on the GPU:

d = __frcp_rn(d);

(3) There are multiple opportunities in this code for the compiler to generate either FMADs (on sm_1x), or FFMAs (on sm_2x), which are merged multiply-and-add operations. For example, the following expression naturally lends itself to the use of FMAD/FMA:

vec3 color_persp = a*v0.color + b*v1.color + c*v2.color;

The results from either FMAD or FFMA are not generally identical to a sequence of FMUL followed by FADD, as would be used on the CPU. FFMA on sm_2x follows the IEEE-754 (2008) specification for fused multiply-add and will in general provide better accuracy than the combination of FMUL and FADD. To force the use of separate FMUL and FADD, one can use the __fmul_rn(), __fadd_rn() device functions, the use of which prevents the compiler from merging these operations into FMAD/FFMA.

(4) Even though the code uses float variables throughout, depending on how it is compiled on the CPU, much of the CPU computation may actually be performed at higher precision. In particular this would likely be the case if the compiler targets the x87 FPU [i.e. not SSE], in which case most floating-point operations would map to in-register computation that is performed with double-precision (on Windows) or extended-precision (on Linux).

While there are host compiler flags designed to reduce “excess precision” being carried in temporary variables (I think /Op for MSVC, -ffloat-store for gcc), the only way to test that hypothesis rigorously on the CPU is to declare all float variables volatile, and also break up the computations into individual operations. For example, this

float z_affine = a*v0.position.z + b*v1.position.z + c*v2.position.z;

would be turned into this:

volatile float z_affine;

z_affine  = a*v0.position.z;

z_affine += b*v1.position.z;

z_affine += c*v2.position.z;

In my personal experience, significant numerical differences observed between CPU and GPU computation on floats are often due to issue (4), but combinations of multiple of the issues mentioned above are also not uncommon.

Wow, great explanations here. Thanks for taking the time to summarize all this.