CUDA 3.2 results different from 3.1

Joe_Fatmama · November 18, 2010, 2:26pm

Hello,

I’m running a rather involved MC simulation using CUDA, and upon updating from 3.1 to 3.2, I noticed a slight, but palpable difference in the results. I’m simply recompiling and rerunning the same code under two CUDA versions (3.1 and 3.2). Are there any known computational differences/enhancements in 3.2 relative to the machine arithmetic and etc.?

Any help is appreciated. Thanks much, Joe

Joe_Fatmama · November 18, 2010, 2:26pm

Hello,

I’m running a rather involved MC simulation using CUDA, and upon updating from 3.1 to 3.2, I noticed a slight, but palpable difference in the results. I’m simply recompiling and rerunning the same code under two CUDA versions (3.1 and 3.2). Are there any known computational differences/enhancements in 3.2 relative to the machine arithmetic and etc.?

Any help is appreciated. Thanks much, Joe

Joe_Fatmama · November 19, 2010, 2:11am

Just confirmed: CUDA 3.1 results are replicated on CPU, while, in some instances, CUDA 3.2 results deviate from CPU and the 3.1. Therefore, I claim CUDA 3.2 is buggy! I suspect the error comes from one of the machine float math functions. I will need more time to trace to the origin.

Hard to imagine my case is special. Has anyone else encountered a similar problem?

Joe_Fatmama · November 19, 2010, 2:11am

Just confirmed: CUDA 3.1 results are replicated on CPU, while, in some instances, CUDA 3.2 results deviate from CPU and the 3.1. Therefore, I claim CUDA 3.2 is buggy! I suspect the error comes from one of the machine float math functions. I will need more time to trace to the origin.

Hard to imagine my case is special. Has anyone else encountered a similar problem?

Sarnath · November 19, 2010, 9:20am

Joe Fatmama,

Could be that the Toolkit has changed a bit…Probably its using some new instructions…

You may need to read up the release notes to see if you need to add some compiler options to disable some optimizations.

Compare the PTX to know the difference.

Sarnath · November 19, 2010, 9:20am

Joe Fatmama,

Could be that the Toolkit has changed a bit…Probably its using some new instructions…

You may need to read up the release notes to see if you need to add some compiler options to disable some optimizations.

Compare the PTX to know the difference.

cudacuda321 · November 19, 2010, 2:52pm

What driver are you using ?

I found some code lost functionality in going from 256.35 to 260.19.21, where CUDA toolkit remained at 3.1 in both case (going to 3.2 did not help the issue for 260.19.21)

cudacuda321 · November 19, 2010, 2:52pm

What driver are you using ?

I found some code lost functionality in going from 256.35 to 260.19.21, where CUDA toolkit remained at 3.1 in both case (going to 3.2 did not help the issue for 260.19.21)

Joe_Fatmama · November 21, 2010, 2:40am

Using 260.99 driver at the moment. Haven’t had the time to see what the problem is. I did observe similarly odd behavior going from an older driver to a 260.xx or 259.xx. If I recall correctly, it was from 257.xx to 260.xx. It appears that my RNG logic that has to do with int64 operations is somehow processed differently. That is, certain seeds are consitently affected by the change in CUDA version/driver. I can currently replicate the 3.1/260.99 output on a CPU. It could be that some conversions from long to uint64, or bit operations are now handled differently, or that this issue was cleaned up in the past using driver updates, but has now resurfaced in 3.2. Testing this out is also a headache, because with a single machine, I am forced to keep switching from 3.1 and 3.2 each time I want to compare output. So, I’m secretly hoping nvidia figures it out and throws a patch into the next driver release :]

Joe_Fatmama · November 21, 2010, 2:40am

Using 260.99 driver at the moment. Haven’t had the time to see what the problem is. I did observe similarly odd behavior going from an older driver to a 260.xx or 259.xx. If I recall correctly, it was from 257.xx to 260.xx. It appears that my RNG logic that has to do with int64 operations is somehow processed differently. That is, certain seeds are consitently affected by the change in CUDA version/driver. I can currently replicate the 3.1/260.99 output on a CPU. It could be that some conversions from long to uint64, or bit operations are now handled differently, or that this issue was cleaned up in the past using driver updates, but has now resurfaced in 3.2. Testing this out is also a headache, because with a single machine, I am forced to keep switching from 3.1 and 3.2 each time I want to compare output. So, I’m secretly hoping nvidia figures it out and throws a patch into the next driver release :]

laughingrice · November 23, 2010, 1:00pm

Using 260.99 driver at the moment. Haven’t had the time to see what the problem is. I did observe similarly odd behavior going from an older driver to a 260.xx or 259.xx. If I recall correctly, it was from 257.xx to 260.xx. It appears that my RNG logic that has to do with int64 operations is somehow processed differently. That is, certain seeds are consitently affected by the change in CUDA version/driver. I can currently replicate the 3.1/260.99 output on a CPU. It could be that some conversions from long to uint64, or bit operations are now handled differently, or that this issue was cleaned up in the past using driver updates, but has now resurfaced in 3.2. Testing this out is also a headache, because with a single machine, I am forced to keep switching from 3.1 and 3.2 each time I want to compare output. So, I’m secretly hoping nvidia figures it out and throws a patch into the next driver release :]

Integer operations should be bit exact on all platforms if memory servers (not sure if whether shift left brings in zeros or ones is well defined though). Another issue may be using bitwise operations that mix 32 and 64 bits, they may leave dirty bits in the conversion (shouldn’t unless you are doing something dirty though, like reinterpreting memory using pointer casts).

They may be sensitive to operation order though in cases of memory overflow and integer division.

Floating point operations are sensitive to order though and are usually not bit exact on all platforms.

The optimizer with modern c/c++ is not allowed to reorder mathematical operations, unless you specify unsafe optimizations. Make sure that you are not enabling 03 or fast math optimizations as those may let the compiler do unsafe optimizations.

The driver also does optimization on the assembly code, and that may be doing the illegal optimization for some reason.

laughingrice · November 23, 2010, 1:00pm

Using 260.99 driver at the moment. Haven’t had the time to see what the problem is. I did observe similarly odd behavior going from an older driver to a 260.xx or 259.xx. If I recall correctly, it was from 257.xx to 260.xx. It appears that my RNG logic that has to do with int64 operations is somehow processed differently. That is, certain seeds are consitently affected by the change in CUDA version/driver. I can currently replicate the 3.1/260.99 output on a CPU. It could be that some conversions from long to uint64, or bit operations are now handled differently, or that this issue was cleaned up in the past using driver updates, but has now resurfaced in 3.2. Testing this out is also a headache, because with a single machine, I am forced to keep switching from 3.1 and 3.2 each time I want to compare output. So, I’m secretly hoping nvidia figures it out and throws a patch into the next driver release :]