I have ported simulation to CUDA v1.1 on XP 32bit with an 8800GTX as one large chunk. This has resulted in what I assume would be considered a large kernel.
lmem = 84 smem = 48 reg = 99 bar = 0
Despite the apparently low occupancy the performance is tantalizing. However, the values I get from back from the CUDA kernel are not quite correct. However, running in EmuRelease or EmuDebug mode yields the correct results. I have converted all the constants, math functions, etc. to single precision versions and have added _controlfp(_PC_24, _MCW_PC); but the problem still exists. I’m checking if each of the cuda*(…) function succeed and they seem fine.
I know this is not a lot of information but I’m not sure if posting the code would help much. Does the kernel size seem insane? Could this be related to the problem? Any suggestions on tracking the potential source of this problem down? I currently plan to just start disabling sections of the code and compare the Emu and non-Emu results.