Program works on GTX 260 but not on GTX 480

It isn’t a compiler issue and warps are still warps in Fermi. See the CUDA 3.1 “Fermi Compatability Guide” section 1.2.2 for details. The short version is that Fermi doesn’t have instructions to operate directly on shared memory, it loads and stores to registers first, so unless you do something explicit (using volatile), compiler optimization can leave values that should be in shared memory in registers, and that breaks the implicit interwarp synchronization a lot of code uses for things like reductions.

Do not get how it helps in globs47 example. How synctheads could hlep here? Does compiler understand thi instruction?

Do not get how it helps in globs47 example. How synctheads could hlep here? Does compiler understand thi instruction?

I am pretty sure the syncthreads() will force a “flush” the intermediate results of shared memory operations back from register. I am guessing this is happening after PTX generation by ptxas for the sm_20 target, and there is would be perfectly feasible that the code analyzer can see and recognize the PTX syncthreads instruction.

I am pretty sure the syncthreads() will force a “flush” the intermediate results of shared memory operations back from register. I am guessing this is happening after PTX generation by ptxas for the sm_20 target, and there is would be perfectly feasible that the code analyzer can see and recognize the PTX syncthreads instruction.

I think this should be at compile stage of c code. How ptxas will know what registers to flush? And registers maybe spoiled. Imagine shared memory is used long before threadcync. It will waster a lot of registers.
… Ok. It may somehow analyse ptx code and write registers to shared memory if it see syncthreads after.

I think this should be at compile stage of c code. How ptxas will know what registers to flush? And registers maybe spoiled. Imagine shared memory is used long before threadcync. It will waster a lot of registers.
… Ok. It may somehow analyse ptx code and write registers to shared memory if it see syncthreads after.

Shared memory transactions exist in their own space in PTX, it is only at the assembler stage that they will be expanded into sm_20 instructions which include the load and store to register. I don’t see how it could possibly be done earlier in the compilation phase than that.

Shared memory transactions exist in their own space in PTX, it is only at the assembler stage that they will be expanded into sm_20 instructions which include the load and store to register. I don’t see how it could possibly be done earlier in the compilation phase than that.

Thanks alls! I did not realize that the compiler could optimize the code in the wrong way! So I have added volatile and suppressed syncro (for it <32) and it works fine!

Yves

Thanks alls! I did not realize that the compiler could optimize the code in the wrong way! So I have added volatile and suppressed syncro (for it <32) and it works fine!

Yves

Goood Good! Thank you Avidday,

Goood Good! Thank you Avidday,

Hi again. Workaround finally found, but this is weird, I think I’ve found a bug. If I insert a printf statement (even printing an empty string) inside one specific kernel, the program runs well. If I remove the printf, it shows incorrect results or even freezes. This time I have used the latest Cuda 3.2 final release, a GTX 460 card (compiling with “-arch=sm_21”) and WinXP 32 bits. I can’t post the code, but I can send it to someone of Nvidia to fix the problem. Be aware that it’s a rather complex code, comparing CPU and GPU solutions, and furthermore all the comments are in spanish, but I can provide a readme file, a small test mesh that fails in the second iteration of the simulation, a bigger mesh that freezes, and I think the workaround will also help you to debug.

Hi again. Workaround finally found, but this is weird, I think I’ve found a bug. If I insert a printf statement (even printing an empty string) inside one specific kernel, the program runs well. If I remove the printf, it shows incorrect results or even freezes. This time I have used the latest Cuda 3.2 final release, a GTX 460 card (compiling with “-arch=sm_21”) and WinXP 32 bits. I can’t post the code, but I can send it to someone of Nvidia to fix the problem. Be aware that it’s a rather complex code, comparing CPU and GPU solutions, and furthermore all the comments are in spanish, but I can provide a readme file, a small test mesh that fails in the second iteration of the simulation, a bigger mesh that freezes, and I think the workaround will also help you to debug.