CUDA 3.2 results different from 3.1

Hello,

I’m running a rather involved MC simulation using CUDA, and upon updating from 3.1 to 3.2, I noticed a slight, but palpable difference in the results. I’m simply recompiling and rerunning the same code under two CUDA versions (3.1 and 3.2). Are there any known computational differences/enhancements in 3.2 relative to the machine arithmetic and etc.?

Any help is appreciated. Thanks much, Joe

Hello,

I’m running a rather involved MC simulation using CUDA, and upon updating from 3.1 to 3.2, I noticed a slight, but palpable difference in the results. I’m simply recompiling and rerunning the same code under two CUDA versions (3.1 and 3.2). Are there any known computational differences/enhancements in 3.2 relative to the machine arithmetic and etc.?

Any help is appreciated. Thanks much, Joe

Just confirmed: CUDA 3.1 results are replicated on CPU, while, in some instances, CUDA 3.2 results deviate from CPU and the 3.1. Therefore, I claim CUDA 3.2 is buggy! I suspect the error comes from one of the machine float math functions. I will need more time to trace to the origin.

Hard to imagine my case is special. Has anyone else encountered a similar problem?

Just confirmed: CUDA 3.1 results are replicated on CPU, while, in some instances, CUDA 3.2 results deviate from CPU and the 3.1. Therefore, I claim CUDA 3.2 is buggy! I suspect the error comes from one of the machine float math functions. I will need more time to trace to the origin.

Hard to imagine my case is special. Has anyone else encountered a similar problem?

Joe Fatmama,

Could be that the Toolkit has changed a bit…Probably its using some new instructions…

You may need to read up the release notes to see if you need to add some compiler options to disable some optimizations.

Compare the PTX to know the difference.

Joe Fatmama,

Could be that the Toolkit has changed a bit…Probably its using some new instructions…

You may need to read up the release notes to see if you need to add some compiler options to disable some optimizations.

Compare the PTX to know the difference.

What driver are you using ?

I found some code lost functionality in going from 256.35 to 260.19.21, where CUDA toolkit remained at 3.1 in both case (going to 3.2 did not help the issue for 260.19.21)

What driver are you using ?

I found some code lost functionality in going from 256.35 to 260.19.21, where CUDA toolkit remained at 3.1 in both case (going to 3.2 did not help the issue for 260.19.21)

Using 260.99 driver at the moment. Haven’t had the time to see what the problem is. I did observe similarly odd behavior going from an older driver to a 260.xx or 259.xx. If I recall correctly, it was from 257.xx to 260.xx. It appears that my RNG logic that has to do with int64 operations is somehow processed differently. That is, certain seeds are consitently affected by the change in CUDA version/driver. I can currently replicate the 3.1/260.99 output on a CPU. It could be that some conversions from long to uint64, or bit operations are now handled differently, or that this issue was cleaned up in the past using driver updates, but has now resurfaced in 3.2. Testing this out is also a headache, because with a single machine, I am forced to keep switching from 3.1 and 3.2 each time I want to compare output. So, I’m secretly hoping nvidia figures it out and throws a patch into the next driver release :]

Using 260.99 driver at the moment. Haven’t had the time to see what the problem is. I did observe similarly odd behavior going from an older driver to a 260.xx or 259.xx. If I recall correctly, it was from 257.xx to 260.xx. It appears that my RNG logic that has to do with int64 operations is somehow processed differently. That is, certain seeds are consitently affected by the change in CUDA version/driver. I can currently replicate the 3.1/260.99 output on a CPU. It could be that some conversions from long to uint64, or bit operations are now handled differently, or that this issue was cleaned up in the past using driver updates, but has now resurfaced in 3.2. Testing this out is also a headache, because with a single machine, I am forced to keep switching from 3.1 and 3.2 each time I want to compare output. So, I’m secretly hoping nvidia figures it out and throws a patch into the next driver release :]

Integer operations should be bit exact on all platforms if memory servers (not sure if whether shift left brings in zeros or ones is well defined though). Another issue may be using bitwise operations that mix 32 and 64 bits, they may leave dirty bits in the conversion (shouldn’t unless you are doing something dirty though, like reinterpreting memory using pointer casts).

They may be sensitive to operation order though in cases of memory overflow and integer division.

Floating point operations are sensitive to order though and are usually not bit exact on all platforms.

The optimizer with modern c/c++ is not allowed to reorder mathematical operations, unless you specify unsafe optimizations. Make sure that you are not enabling 03 or fast math optimizations as those may let the compiler do unsafe optimizations.

The driver also does optimization on the assembly code, and that may be doing the illegal optimization for some reason.

Integer operations should be bit exact on all platforms if memory servers (not sure if whether shift left brings in zeros or ones is well defined though). Another issue may be using bitwise operations that mix 32 and 64 bits, they may leave dirty bits in the conversion (shouldn’t unless you are doing something dirty though, like reinterpreting memory using pointer casts).

They may be sensitive to operation order though in cases of memory overflow and integer division.

Floating point operations are sensitive to order though and are usually not bit exact on all platforms.

The optimizer with modern c/c++ is not allowed to reorder mathematical operations, unless you specify unsafe optimizations. Make sure that you are not enabling 03 or fast math optimizations as those may let the compiler do unsafe optimizations.

The driver also does optimization on the assembly code, and that may be doing the illegal optimization for some reason.

My rock stable software produces in-correct results (all zeroes) with CUDA 3.2, 260.19 linux driver. 190.53 driver works fine for me.

Oops…Here is my rant: http://forums.nvidia.com/index.php?showtopic=186645

My rock stable software produces in-correct results (all zeroes) with CUDA 3.2, 260.19 linux driver. 190.53 driver works fine for me.

Oops…Here is my rant: http://forums.nvidia.com/index.php?showtopic=186645