Number of blocks parameter for kernel when GPU has just one SM

Olumide · August 1, 2017, 11:34pm

This is somewhat a beginner question. Please bear in mind that I am still quite new to CUDA programming.

I’ve got a Quadro K1000M in my laptop and when I queried the specs with cudaGetDeviceProperties() I got:

Max threads per SM: 2048 -- Num multiprocessors: 1

Given that my GPU has just one SM does it make sense to generate more than 1 block when running my kernel, i.e.

fancySchmancyKernel<<<1,N>>>()

Considering that the kernel does not use shared memory as threads do not need to communicate.

Robert_Crovella · August 1, 2017, 11:44pm

Yes, it might make sense. A single block has a maximum of 1024 threads. As you have shown, the max capacity for that SM is 2048 threads. This might also be called “maximum occupancy”, in this case. Therefore you should launch at least 2 blocks of 1024 threads, or more blocks if your threads per block is less.

Whether or not the difference between 1024 resident threads and 2048 resident threads makes a performance difference would be a function of your actual code, but in many cases it will make a perf difference, as this is a fundamental parameter that determines a GPU’s ability to hide latency.

Olumide · August 2, 2017, 12:11am

Ah, you’re right – thanks txbob. The max number of threads per block is 1024. (I’ve just queried that from cudaDeviceProp).

When compiling for a much beefier GPU should my block creation strategy be to create at least as many blocks as there are SMs with the hope that the blocks will be distributed among all SMs? And, in general, do smaller blocks amount to better sharing of the workload between SMs? For example if I had 500 threads and 4 SMs but create blocks of size 100. There is a chance that one of the SMs would have two blocks while the rest have one each. Or is it futile to think about these things considering that the none of the details about how blocks are assigned to SMs has been made public.

LongY · August 4, 2017, 8:18pm

[url]https://users.ices.utexas.edu/~sreepai/fermi-tbs/[/url] This link might shed some light on how thread blocks are distributed.

Topic		Replies	Views
Scheduling Thread Blocks CUDA Programming and Performance	5	1166	July 29, 2021
maximum threads per block not always used CUDA Programming and Performance	2	752	June 14, 2018
confusion of basic concepts CUDA Programming and Performance	8	6305	May 18, 2011
Maximum number of blocks Legacy PGI Compilers	5	2385	April 7, 2020
More blocks than SMs may not make sense CUDA Programming and Performance	13	2663	November 11, 2010
Why is max threads per sm larger than max threads per block? CUDA Programming and Performance	3	1016	January 5, 2024
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27631	February 15, 2010
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3816	August 23, 2017
How to use "block" and "thread" CUDA Programming and Performance	5	1255	October 16, 2013
Max threads/blocks CUDA Programming and Performance	10	78	September 6, 2024

Number of blocks parameter for kernel when GPU has just one SM

Related topics