Wrong results with -arch=sm_20 on a compute capability 2.0 GPU -arch=sm_13 and -arch=sm_20 does not

Hello!

I have a problem with a CUDA program.
When i compiled it with -arch=sm_13 and run it on a compute capability 2.0 GPU it works fine and give me the good result (i know it to be correct).
But when i compiled it with -arch=sm_20 and run it on the same GPU it works but give me wrong results. I have not change the program between the two compilations.
From where the problem can come? What are the differences between theses two versions that can cause problems.

Thanks.

What is the difference between a “right result” and a “wrong result”, exactly?

I use GPU to do some arithmetic operations and with -arch=sm_13 the results given by my program is right (i verified it with sage). But with -arch_sm20 the program output a result which is completly different from the first one, so wrong.

For info here is my last lines of deviceQuery :

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 4, Device = Tesla S2050, Device = Tesla S2050

You are not making all that easy to suggest what might be going wrong.

Here are two completely random guesses:

    [*]When you build for sm_20, the default single precision floating point rounding behaviour is different to earlier cards.

    [*]Fermi lacks direct shared memory access instructions, and that can lead to compiler optimisation breaking code that relies on implicit synchronisation between threads within the same warp. The classic example is a shared memory reduction.

Neither should apply to sm_13 code which is JIT translated to Fermi. Could either of these possibly contributed to the behaviour you are seeing?

The second one looks like a good candidate. If i understand you correctly the following code won’t work as expected on a Fermi :

val=1;

if (threadIdx.x==0)

  val=42; // val is in shared memory

return val;

If i want it to run correctly (return 42 for all threads within a warp) i should put a __syncthreads(). But this code work on a compute capability 1.3 GPU. Did i understand you correctly?

After further investigation, it might seem that the problem came from memory copy beetwen host and device. Is it possible? Is there a difference betwwen 1.3 and 2.0 on this level?

Thanks.

Hi avidday, can you please explain further on this point? What about STS and LDS?

The code you wrote appear extra to me. From what I see, you should just use return 42, without the conditional operation.

If might help if you can present us a self-contained, short piece of code that could produce such a problem for you.