I have been developing my first CUDA program and while the emulation builds work correctly, when I attempt to run release builds, my kernel does not seem to ever execute (my result arrays are all 0). I am currently writing my code on a 32-bit XP machine without a CUDA-capable card and testing it on a 64-bit Vista machine with a 9600 GT and a 32-bit XP machine with a 8400M GT. What could possibly be going wrong?
Try something like this and run it in debug mode. See if it actually runs. If you get a “unspecified launch failure” youre most likely writing outside the bounds of an array.