Large Kernel and Emu vs. Device Differences

I have ported simulation to CUDA v1.1 on XP 32bit with an 8800GTX as one large chunk. This has resulted in what I assume would be considered a large kernel.

lmem = 84 smem = 48 reg = 99 bar = 0

Despite the apparently low occupancy the performance is tantalizing. However, the values I get from back from the CUDA kernel are not quite correct. However, running in EmuRelease or EmuDebug mode yields the correct results. I have converted all the constants, math functions, etc. to single precision versions and have added _controlfp(_PC_24, _MCW_PC); but the problem still exists. I’m checking if each of the cuda*(…) function succeed and they seem fine.

I know this is not a lot of information but I’m not sure if posting the code would help much. Does the kernel size seem insane? Could this be related to the problem? Any suggestions on tracking the potential source of this problem down? I currently plan to just start disabling sections of the code and compare the Emu and non-Emu results.


I do the same thing in converting code. Step by step and comparing the results after each step. So disabling the last part of your code and comparing intermediate results seems to me the best option you have.

OK, I made up some macros to help store the state of every variable, reduced the execution to a single thread and found the problem appears to be following…

There is a loop in the kernel that looks like:

for(int h=1; h<25; h++) {


the body of which is not dependent on h and cannot terminate early. Within the body there are several accumulators. The number of loops is important as is the order.

The emulation mode version works fine. The device version loops through only 19 times.

Add ‘#pragma unroll 1’ before the loop has no effect. Adding volatile as in

for(volatile int h=1; h<25; h++)

causes the device version to loop the correct number of times.

Making the loop’s body dependent on h such as when I instrument to examine state makes the device version run the correct number of times.

The PTX does not change in length much. The differences between the version with the volatile and the version without seems to be mainly a few moves.

I am at a loss as to why this might be happening. Any ideas?


I’m starting to work backwards now stripping out all the debug crud and found that removing the volatile’s from many of the float variables also introduces large numeric error. I don’t mind leaving the volatile’s in place but I would feel a lot more comfortable knowing why these volatiles are needed.

Adding to the custom build options -Xptxas -O0 also seems to “fix” these problems… -Xptxas -O1 presents them again.

I hope someone from NVIDIA reads this…

After adding -Xptxas -O0 AND altering the kernel to index the parameter space based on thread id computed from thread_id = blockIdx.x*64 + threadIdx.x I see the following error.

1>Internal error
1>nvcc error : ‘ptxas’ died with status 0xC0000005 (ACCESS_VIOLATION)

The internal error is not seen with -Xptxas -O1

Please post code that reproduces the bug and we will file it. Better yet, become a registered developer and you can file bugs directly into our system!