Get the number of cores used in a kernel launch

Hi,

I am using Volta V100 GPU, It has 5120 cores divided in 80 streaming multiprocessor
. I am launching a kernel as follows

dim3 block (16,16,1)(;
dim3 grid 200;
kernel<<<grid, block>>>();

So here the total number of threads launched are 2001616 = 51200

does this mean that each thread is occupying each core and we are using the entire set of cores 10(5120 * 10)times to finish executing the kernel?

No, because allocation of threads to streaming multiprocessors is at the granularity of blocks.

On a side note, it is not useful or even correct to think of a thread occupying a core. A core in Nvidia terminology is what is called a floating-point unit (FPU) on CPUs, while the closest CUDA equivalent to a CPU core is a streaming multiprocessor (SM); and each thread is uniquely assigned to an SM.

the GPU I am using has 80 Streaming Multiprocessor, each multiprocessor with 64 cores. So in the above example, each block gets assigned to each streaming multiprocessor. Since there are only 80 streaming multiprocessor, the first set of 80 blocks the scheduler chooses to execute runs and the rest 120 blocks wait for its turn?
And within each block, since there are 256 threads in each block it uses 4 times the same multiprocessor to complete the block?
Is my understanding correct?
So within this grid-block arrangement, are all the cores are used at a given time?

Since there are only 80 streaming multiprocessor, the first set of 80 blocks the scheduler chooses to execute runs and the rest 120 blocks wait for its turn?

There’s really no reason given to assume that the block scheduler would schedule the first 80 blocks then wait for any reason. All blocks will be scheduled until full occupancy is met, at which point the block scheduler will wait until one or more blocks retire, to schedule further blocks.

And within each block, since there are 256 threads in each block it uses 4 times the same multiprocessor to complete the block?

The 64 CUDA cores will chip away at the work using warps, containing 32 threads.

So within this grid-block arrangement, are all the cores are used at a given time?

In the CUDA programming model it is all handle by the hardware scheduler.

@sajshaj94

After some sleep I thought of a different way to explain your question 1.

If you look at Table 15. Technical Specifications per Compute Capability in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities. The maximum number of resident threads per multiprocessor for a V100, or CC 7.0, is 2048. This is much larger than the 64 CUDA cores on the SM.

So with your example, of 256 threads per block, you can have 8 block active blocks on an SM.

You’re launching 200 blocks, while 8 block active blocks per SM * 80 SM allows 640 active block.

This means you would be utilizing less than 30% of the resources. You need to feed the GPU more work.

Keep in mind, that this is assuming you’re not maxing out any other resources. As an example, if your kernel requires all available shared memory in an SM only one block will be active at a time.

One way to get more insight to the efficiency of your kernels is to profile them with Nsight Compute.

https://developer.nvidia.com/nsight-compute-2019_5

https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/

https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9345-cuda+kernel+profiling+using+nvidia+nsight+compute