No, because allocation of threads to streaming multiprocessors is at the granularity of blocks.
On a side note, it is not useful or even correct to think of a thread occupying a core. A core in Nvidia terminology is what is called a floating-point unit (FPU) on CPUs, while the closest CUDA equivalent to a CPU core is a streaming multiprocessor (SM); and each thread is uniquely assigned to an SM.
the GPU I am using has 80 Streaming Multiprocessor, each multiprocessor with 64 cores. So in the above example, each block gets assigned to each streaming multiprocessor. Since there are only 80 streaming multiprocessor, the first set of 80 blocks the scheduler chooses to execute runs and the rest 120 blocks wait for its turn?
And within each block, since there are 256 threads in each block it uses 4 times the same multiprocessor to complete the block?
Is my understanding correct?
So within this grid-block arrangement, are all the cores are used at a given time?
Since there are only 80 streaming multiprocessor, the first set of 80 blocks the scheduler chooses to execute runs and the rest 120 blocks wait for its turn?
There’s really no reason given to assume that the block scheduler would schedule the first 80 blocks then wait for any reason. All blocks will be scheduled until full occupancy is met, at which point the block scheduler will wait until one or more blocks retire, to schedule further blocks.
And within each block, since there are 256 threads in each block it uses 4 times the same multiprocessor to complete the block?
The 64 CUDA cores will chip away at the work using warps, containing 32 threads.
So within this grid-block arrangement, are all the cores are used at a given time?
In the CUDA programming model it is all handle by the hardware scheduler.
So with your example, of 256 threads per block, you can have 8 block active blocks on an SM.
You’re launching 200 blocks, while 8 block active blocks per SM * 80 SM allows 640 active block.
This means you would be utilizing less than 30% of the resources. You need to feed the GPU more work.
Keep in mind, that this is assuming you’re not maxing out any other resources. As an example, if your kernel requires all available shared memory in an SM only one block will be active at a time.
One way to get more insight to the efficiency of your kernels is to profile them with Nsight Compute.