Well first of all, that code snippet doesn’t even allocate device memory, which is what I’m saying is the problem here.
Second of all, the runtime API documentation explicitly states that the only “initialization” required is cudaSetDevice(int deviceIndex) and even this is optional, as executing a device command without selecting a device will automatically initialize device 0. This initialization is done per thread.
The supporting evidence for this would be that everything was working without any calls to CUDA_INIT_DEVICE() or cudaSetDevice for a time being, then it just stopped (which actually makes me a little nervous, did I permanently damage my 8800 GTX by running a crashing CUDA program?).
The only thing I can think of is that the main difference between a program you’re running and this one is that I’m linking with SDL and SDLmain.lib, which requires that I set the code generation for the project to multithreaded DLL or multithreaded debug DLL. I don’t understand why this would have this kind of effect however, or still how I was able to successfully cudaMalloc objects before the program crashed this one time. (Before anyone asks, YES I have restarted my computer a dozen times since then, same thing).