If nothing happens, then it is quite likely your kernel aborted due to an error (you have to check the return codes from cuda functions after your kernel call to catch this). Since you don’t show any calls to cudaMalloc()/cudaMemcpy() in your example, the first thing to verify is that the pointers you are passing to your kernel are device pointers, not host pointers. Accessing a host pointer on the device causes an immediate kernel abort.
Check the return code from cudaThreadSynchronize() after your kernel call. The synchronization function is not required for correctness, but that is the easiest way to ensure you catch all kernel execution errors during debugging.
No CUDA function prints errors to the screen automatically (as this is terrible behavior for a library), so unless you check return codes, you will never know what is failing. :)
A simple error handling scheme that prints the error and aborts is found in a shared header in the CUDA SDK:
cudaError err = cudaThreadSynchronize(); // Put whatever call here
if( cudaSuccess != err) {
fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
Ah, this is interesting. I haven’t used emulation mode in years, but as I recall, it is incredibly permissive. (i.e., it will let you do things that are not allowed on real CUDA devices, like pass host pointers to kernels.) Most of the failure modes I was imagining you having would not even occur in emulation mode…
Seems to work for me. Took out the timing stuff (doesn’t compile on windows), collapsed all the code into one file, hardwired the size of the automatic arrays “char line[5000]”–nvcc windows doesn’t like non-constant autos, and hardwired the sizes to 3 by 3 (but also tried with 4800 x 512), and removed num_dims. Input was:
Note, you don’t call cudaFree of the cudaMalloc pointer’s. Ran it on a GeForce 470 and 9800. With these kinds of problems, always good to simplify the code as much as possible (I can hardly read read_matrix_col_major with all the commented out code) and always check return codes. You don’t do any checking anywhere–a bad habit that creeps into real code. Unfortunately, most code I’ve seen posted doesn’t do any return code checking. Also good to check your build to make sure you are compiling and linking it as you expect. If you still cannot get anything to work, try going back to a “helloworld.cu” example (one kernel call with one parameter pointer assignment, one block in grid, one thread in block) and verify you have everything installed right.