Get the number of cores used in a kernel launch

sajshaj94 · November 26, 2019, 10:19am

Hi,

I am using Volta V100 GPU, It has 5120 cores divided in 80 streaming multiprocessor
. I am launching a kernel as follows

dim3 block (16,16,1)(;
dim3 grid 200;
kernel<<<grid, block>>>();

So here the total number of threads launched are 2001616 = 51200

does this mean that each thread is occupying each core and we are using the entire set of cores 10(5120 * 10)times to finish executing the kernel?

tera · November 26, 2019, 8:39pm

No, because allocation of threads to streaming multiprocessors is at the granularity of blocks.

On a side note, it is not useful or even correct to think of a thread occupying a core. A core in Nvidia terminology is what is called a floating-point unit (FPU) on CPUs, while the closest CUDA equivalent to a CPU core is a streaming multiprocessor (SM); and each thread is uniquely assigned to an SM.

sajshaj94 · November 27, 2019, 1:59am

Hi,

I am using Volta V100 GPU, It has 5120 cores divided in 80 streaming multiprocessor
. I am launching a kernel as follows
dim3 block (16,16,1)(;
dim3 grid 200;
kernel<<<grid, block>>>();
So here the total number of threads launched are 2001616 = 51200

does this mean that each thread is occupying each core and we are using the entire set of cores 10(5120 * 10)times to finish executing the kernel?

the GPU I am using has 80 Streaming Multiprocessor, each multiprocessor with 64 cores. So in the above example, each block gets assigned to each streaming multiprocessor. Since there are only 80 streaming multiprocessor, the first set of 80 blocks the scheduler chooses to execute runs and the rest 120 blocks wait for its turn?
And within each block, since there are 256 threads in each block it uses 4 times the same multiprocessor to complete the block?
Is my understanding correct?
So within this grid-block arrangement, are all the cores are used at a given time?

mnicely · December 12, 2019, 4:54am

Since there are only 80 streaming multiprocessor, the first set of 80 blocks the scheduler chooses to execute runs and the rest 120 blocks wait for its turn?

There’s really no reason given to assume that the block scheduler would schedule the first 80 blocks then wait for any reason. All blocks will be scheduled until full occupancy is met, at which point the block scheduler will wait until one or more blocks retire, to schedule further blocks.

And within each block, since there are 256 threads in each block it uses 4 times the same multiprocessor to complete the block?

The 64 CUDA cores will chip away at the work using warps, containing 32 threads.

So within this grid-block arrangement, are all the cores are used at a given time?

In the CUDA programming model it is all handle by the hardware scheduler.

mnicely · December 12, 2019, 12:16pm

@sajshaj94

After some sleep I thought of a different way to explain your question 1.

If you look at Table 15. Technical Specifications per Compute Capability in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities. The maximum number of resident threads per multiprocessor for a V100, or CC 7.0, is 2048. This is much larger than the 64 CUDA cores on the SM.

So with your example, of 256 threads per block, you can have 8 block active blocks on an SM.

You’re launching 200 blocks, while 8 block active blocks per SM * 80 SM allows 640 active block.

This means you would be utilizing less than 30% of the resources. You need to feed the GPU more work.

Keep in mind, that this is assuming you’re not maxing out any other resources. As an example, if your kernel requires all available shared memory in an SM only one block will be active at a time.

One way to get more insight to the efficiency of your kernels is to profile them with Nsight Compute.

https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9345-cuda+kernel+profiling+using+nvidia+nsight+compute

Topic		Replies	Views
Question about NVVP results: GPU's SMs and cores during CUDA kernel execution CUDA Programming and Performance	1	1213	April 13, 2019
Max (resident \| active) threads for V100 & A100 CUDA Programming and Performance	1	4638	October 31, 2021
low concurrency and low kernel utilization, but kernels are filled. CUDA Programming and Performance	6	1413	November 18, 2018
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3829	August 23, 2017
How to tune the SM utilization (across the entire GPU) of a CUDA kernel? CUDA Programming and Performance cuda , kernel , ubuntu	4	1049	July 23, 2023
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3980	February 9, 2024
How to limit number of CUDA Cores CUDA Programming and Performance	7	6118	April 22, 2016
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	825	March 30, 2024
Grid size limit of concurrent kernels CUDA Programming and Performance	6	784	April 5, 2024
Optimal threads vs blocks CUDA Programming and Performance	4	4023	April 24, 2011

Get the number of cores used in a kernel launch

Related topics