Large Kernel and Emu vs. Device Differences

wildcat4096 · January 21, 2008, 7:12am

I have ported simulation to CUDA v1.1 on XP 32bit with an 8800GTX as one large chunk. This has resulted in what I assume would be considered a large kernel.


        lmem = 84
        smem = 48
        reg = 99
        bar = 0

Despite the apparently low occupancy the performance is tantalizing. However, the values I get from back from the CUDA kernel are not quite correct. However, running in EmuRelease or EmuDebug mode yields the correct results. I have converted all the constants, math functions, etc. to single precision versions and have added _controlfp(_PC_24, _MCW_PC); but the problem still exists. I’m checking if each of the cuda*(…) function succeed and they seem fine.

I know this is not a lot of information but I’m not sure if posting the code would help much. Does the kernel size seem insane? Could this be related to the problem? Any suggestions on tracking the potential source of this problem down? I currently plan to just start disabling sections of the code and compare the Emu and non-Emu results.

Thanks

DenisR · January 21, 2008, 8:43am

I do the same thing in converting code. Step by step and comparing the results after each step. So disabling the last part of your code and comparing intermediate results seems to me the best option you have.

wildcat4096 · January 21, 2008, 7:47pm

OK, I made up some macros to help store the state of every variable, reduced the execution to a single thread and found the problem appears to be following…

There is a loop in the kernel that looks like:

for(int h=1; h<25; h++) {
…
}

the body of which is not dependent on h and cannot terminate early. Within the body there are several accumulators. The number of loops is important as is the order.

The emulation mode version works fine. The device version loops through only 19 times.

Add ‘#pragma unroll 1’ before the loop has no effect. Adding volatile as in

for(volatile int h=1; h<25; h++)

causes the device version to loop the correct number of times.

Making the loop’s body dependent on h such as when I instrument to examine state makes the device version run the correct number of times.

The PTX does not change in length much. The differences between the version with the volatile and the version without seems to be mainly a few moves.

I am at a loss as to why this might be happening. Any ideas?

Thanks

wildcat4096 · January 21, 2008, 8:51pm

I’m starting to work backwards now stripping out all the debug crud and found that removing the volatile’s from many of the float variables also introduces large numeric error. I don’t mind leaving the volatile’s in place but I would feel a lot more comfortable knowing why these volatiles are needed.

Adding to the custom build options -Xptxas -O0 also seems to “fix” these problems… -Xptxas -O1 presents them again.

wildcat4096 · January 21, 2008, 9:29pm

I hope someone from NVIDIA reads this…

After adding -Xptxas -O0 AND altering the kernel to index the parameter space based on thread id computed from thread_id = blockIdx.x*64 + threadIdx.x I see the following error.

1>Internal error
1>nvcc error : ‘ptxas’ died with status 0xC0000005 (ACCESS_VIOLATION)

The internal error is not seen with -Xptxas -O1

Mark_Harris · January 22, 2008, 3:04pm

Please post code that reproduces the bug and we will file it. Better yet, become a registered developer and you can file bugs directly into our system!

Thanks,
Mark

Topic		Replies	Views
Code works under emulation, but fails on the device CUDA Programming and Performance	3	2194	July 30, 2009
Different Output on Device and Emulation Mode 2 What makes difference b/w Emu and Dev? CUDA Programming and Performance	15	13508	December 2, 2007
Different results in emu vs. release mode CUDA Programming and Performance	2	1205	October 27, 2008
Problem:Different results on EMULATION and RELEASE Problem on release and debug mode CUDA Programming and Performance	4	1898	June 5, 2008
Different results with and without emulation mode CUDA Programming and Performance	6	1678	February 1, 2010
Please help: different result, emu vs non emu CUDA Programming and Performance	8	5224	September 11, 2008
Program gives wrong answer except with emulation CUDA Programming and Performance	8	3907	April 21, 2008
Emulator works but G80 doesn't CUDA Programming and Performance	11	5453	July 3, 2007
Different results on device and Emulation mode CUDA Programming and Performance	5	3582	February 5, 2009
app runs fine in Emu, but not on the card CUDA Programming and Performance	1	4127	April 25, 2007

Large Kernel and Emu vs. Device Differences

Related topics