Wrong Results ... ... when changing from TESLA C870 to TESLA S1070

Dear all,

I have a small piece of code that gives good results on a TESLA C870 device on a 32 bit Linux machines. When taking the exact same code and compiling and running it on a TESLA S1070 device on a 64 bits Linux machines, results are now wrong.

Has anyone already faced this kind of problem ?

Thanks in advance for helping me solve this issue.



Check for syncronization in your kernel.

I experienced the same issue when moving from a Tesla C870 to a GTX280, so same architecture change as you. In my case the wrong results also come with a random behaviour, so I also suspect as Mr Nuke that this comes from a synchronization issue.
However I dont understand why a sync issue may happen in 1.3 device and not in 1.0 device. As far as I know they have same warp size, so where does it come from?

I’m having the exact same problem. Only that my program works fine on a GeForce 8600 GT and gives wrong results on a GeForce GTX 280 (also with random behaviour). I try to place __syncthreads after every kernel instruction but still didn’t work. Any ideas?

Try a __syncthreads before and after every global load/store, and every shared load/store, see if it clears up the problem.

I’ve tried your suggestion but it does not seem to work in my case.

If you have any other idea …

The issue may be deeper than I thought. I can’t guarantee I’ll find a solution, as I don’t have a GT200 to test on, but I’ll gladly have a look at your kernel code if you can post it.


Here it is. Thank you for having a look at it.

The main program is a Fortran code which aim is to invert a complex double precision matrix A.

I hope you might find what’s is wrong with it. The Simple Precision program should work.
GPU_Invert_Simple.rar (6.24 KB)
GPU_Invert_Double_Prec.zip (6.71 KB)

I found the problem in my kernel. Mr Nuke was right - it was a syncronization problem. But it could not be solve by __syncthreads.
My problem was that I had different blocks changing the same memory zone. Hope this can help find your problems.

As I said, I don’t have a GT200 to test on, but I did glance over the code. If the size of the matrix is a multiple of 32x32, then the following might not be the problem:

The memory seems to be unpadded to match the warp size. Let’s say that you have a 31x31 matrix (31 rows, 31 columns). Thread [31] of the first warp should read and process row[0], column[31], but because of the arrangement, it will process row[1] column[0]. My suggestion is to make sure the memory is properly padded, or for a quick test, see if matrices multiple of 32x32 produce correct results.

I’ll be looking to see if I can find anything else that’s wrong.