Performance and computation bug related to floating points

I have a very strange issue with the program I wrote. It’s supposed to be a parametric surface raytracer - but it’s not finished yet. The way I wrote it is downright deterministic. Each frame should be the same if the parameters are the same. But for some inputs, namely Chmutovs curve (defined with Chebyshev’s polynomials, see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.9031&rep=rep1&type=pdf), the results are nondeterministic. They don’t occur with other surfaces.

These are two screenshots taken from the program. Input vales were exactly the same:

External Media

External Media

Perhaps such undeterminism is expected - but personally I didn’t came across any documentation saying that.

To make things even weirder, this behaviour is correlated with code performance. This little table sums up my experiments:

#n | gencode opts | compiler options      | FPS  | effects      

-----------------------------------------------------------------------

1. | sm_21        | -use_fast_math        | 32   | very bad visuals    

2. | sm_21        | ftz=1 prec-sqrt=1     | 41   | same as above

3. | sm_21        | ftz=0 prec-sqrt=1     | 187  | mostly fine, minor artifacts

4. | sm_21        |                       | 187  | same as above

5. | sm_21        | -use_fast_math

   |                 ftz=0 prec-sqrt=1    | 403  | lot's of artifacts, but not that many as in #2

6. | (none given) |                       | 298  | at first the view is perfect, but moving around shows a small number of mostly-steady artifacts (only small variations on the screen, not a huge black patches)

7. | (none given) | ftz=1 prec-sqrt=1     | 287  | same as above

8. | (none given) | ftz=0 prec-sqrt=1     | 282  | same as above

9. | (none given) | -use_fast_math        | 193  | ** perfect render ** - no artifacts anywhere

10.| (none given) | -use_fast_math

   |                 ftz=0 prec-sqrt=1    | 374  | another ** perfect render ** combination - also notice high framerate

This whole thing is arcane. I thought that compiling for sm_21 would give the same results as leaving the compilation up to JIT but obviously I was mistaken. If seems that -use_fast_math without giving any gencode options yields the best and the most correct code, which is strange.

Below is my system setup. But I did test the code on other machines with different graphic cards and the artifacts were still there.

OS: Arch Linux, 64 bit, kernel 2.6.36-ARCH

CPU: i7 870 @ 2.93GHz

Graphics driver: NVIDIA UNIX x86_64 Kernel Module  260.19.36

Graphics card: nVidia Corporation GF104 [GeForce GTX 460] (rev a1)

Memory (I can’t remember the exact hardware now, sorry):

# dmidecode 2.11

SMBIOS 2.6 present.

Handle 0x0008, DMI type 5, 24 bytes

Memory Controller Information

	Error Detecting Method: 64-bit ECC

	Error Correcting Capabilities:

		None

	Supported Interleave: One-way Interleave

	Current Interleave: One-way Interleave

	Maximum Memory Module Size: 2048 MB

	Maximum Total Memory Size: 8192 MB

	Supported Speeds:

		Other

	Supported Memory Types:

		DIMM

		SDRAM

	Memory Module Voltage: 3.3 V

	Associated Memory Slots: 4

		0x0009

		0x000A

		0x000B

		0x000C

	Enabled Error Correcting Capabilities:

		None

Handle 0x0009, DMI type 6, 12 bytes

Memory Module Information

	Socket Designation: DIMM0

	Bank Connections: 0 1

	Current Speed: Unknown

	Type: DIMM SDRAM

	Installed Size: 2048 MB (Double-bank Connection)

	Enabled Size: 2048 MB (Double-bank Connection)

	Error Status: OK

Handle 0x000A, DMI type 6, 12 bytes

Memory Module Information

	Socket Designation: DIMM1

	Bank Connections: 2 3

	Current Speed: Unknown

	Type: DIMM SDRAM

	Installed Size: 2048 MB (Double-bank Connection)

	Enabled Size: 2048 MB (Double-bank Connection)

	Error Status: OK

Handle 0x000B, DMI type 6, 12 bytes

Memory Module Information

	Socket Designation: DIMM2

	Bank Connections: 4 5

	Current Speed: Unknown

	Type: DIMM SDRAM

	Installed Size: 2048 MB (Double-bank Connection)

	Enabled Size: 2048 MB (Double-bank Connection)

	Error Status: OK

Handle 0x000C, DMI type 6, 12 bytes

Memory Module Information

	Socket Designation: DIMM3

	Bank Connections: 6 7

	Current Speed: Unknown

	Type: DIMM SDRAM

	Installed Size: 2048 MB (Double-bank Connection)

	Enabled Size: 2048 MB (Double-bank Connection)

	Error Status: OK

There is no way I can provide you with isolated case, the whole program is way too complex. But I can provide you with full source code if you wish.

Do you use atomic operations anywhere in your code?

Any other places where different threads access the same memory?

Uninitialized memory being used? Or a race condition(read-after-write without proper synchronization, for example from/to shared memory)?
Since the bug is affected by the optimizations flags i’d guess for the latter(realworld timing changes).

The atomics may be used by the Thrust library, but i doubt it. I don’t use them myself. Furthermore the artifacts appear also on devices which doesn’t have support for atomics: GeForce 8600 GS (mobile version).

The memory in question is Pixel Buffer Object mapped with appropriate calls. I can’t tell what was in there before the call, but I do write to every single cell out there. The launch(es) itself are pretty simple:

extern "C" void launch_raytrace_kernel(uint * pbo, View view, int w, int h)

{

  std::cerr << "w=" << w << std::endl

            << "h=" << h << std::endl; 

PrintView( view );

for(int ix_h = 0; ix_h < h; ix_h++)

    {

      thrust::transform( thrust::make_counting_iterator< short >(0),

                         thrust::make_counting_iterator< short >(w),

                         thrust::device_ptr< uint >(pbo + h * ix_h),

                         TracePoint(w,h,ix_h,

                                    view));

      // 1

    }

  // 2

}

As to the race conditions: I’m running two threads normally. One is used for on-line parameter configuration, second one is for doing actual job. Disabling the first one doesn’t help. I also tried inserting cudaThreadSynchronize in // 1 and // 2 above, but it had no effect other than decrease FPS.

I’ve put all my code related to this on Github. See: GitHub - Tener/cuda-course: Code for the "parallel programing on CUDA" university course . The problematic program is in folder “pracownia/projekt-cz2”. The code is rather portable as it is now, except for the pthread library used in one place. But you can comment out this code and it will still work.

The easiest way to compile it is to grab a Linux box, install all needed libraries (Glut, GLFW, boost_system, boost_thread), fix the makefile so we link against right libraries (boost libraries have notorious naming issues, they are different in every distribution), and type “make clean && make”.

(premature post)

So, I found the bug. It was indeed uninitialized memory in this case, but in a somewhat tricky way. Just see for yourself:

template <int N>

struct Chebyshev_Pol

{

  __host__ __device__

  static float calculate(float x)

  {

    float arr[N+1];

    arr[0] = 1;

    arr[1] = x;

#pragma unroll 16

    for(unsigned int i = 2; i < N+1; i++)

      {

        arr[i] = 2 * x * arr[i-1] - arr[i-2];

//      This line looked like this:

//	arr[i] = 2 * x * arr[N-1] - arr[N-2]; <-- 'N' instead of 'i'

//      So in effect the kernel was reading always two last elements of the array, 

//      instead of consecutive ones.

      }

    return arr[N];

  }

};