Discrepency between multiprocessor, CUDA Core and kernel function

Hi, I am a new guy on the CUDA Programming. I have read the Guide, but I am still not clear about the multiprocessor, CUDA Core and kernel function.

When the program runs a kernel function, is it running on one multiprocessor, or one CUDA core?

My GPU is Quadra FX 380, When I run the “deviceQuery” function, it returns that:

“(2) Multiprocessors x (8) CUDA Cores/MP: 16 CUDA Cores”

When the program run a kernal function, will it automatically utilize all the 16 cores?

Thanks for replies.

A kernel runs on all (available) multiprocessors and thus also all cores in parallel. You only need to make sure that all threads do different things based on their thread and block indices.

Specifically in your case where you have 2 SM and thus 16 cuda core, you have to launch ad-minima 64 threads to fill the units, and if possible at least 384 threads to hide main latencies (you will still be blocked by global memory reads).