Max Concurrent Threads

Hi All,

I am new to CUDA programing (using C++ since few months) and using it to solve some mathematical equations.
I do small atomic calculations in the kernel calls.

I have the RTX 3060TI GPU, now the question is what is the max concurrent threads that i can start on this GPU?

I have been looking in the internet and everyone is explaining blocks and grids but what are the real numbers for specific hardware !!

Is there a small code/program that shows those numbers ?
I found some tools that show cores =38x128 =4864
Some other show Grid=304x128

I will be very thankful for your answer

Regards

The max concurrent threads is the maximum threads per SM times the number of SMs in your GPU. The number of SMs in your GPU can be discovered perhaps by a google search (seems to be 38), or by running the deviceQuery sample code.

The maximum threads per SM I think is also reported in deviceQuery and can be gotten from a table in the programming guide. (Your 3060Ti device is a cc8.6 GPU, which can also be discovered from deviceQuery)

38x1536 = 58,368 max concurrent threads

So a good “minimum” kernel launch config to aim for might be 114 blocks of 512 threads each.

Thanks @Robert_Crovella for your quick answer.
As said i am watching some videos and reading as much as i can to understand more about the GPU programing.

Honestly i read this 1536 somewhere but thought 38x1536 = 58,368 is a silly number (i am used to 2,4,8…256. 512…) therefore i thought that 1536 could be wrong.

Now i have printed the cudaDeviceProp:
GPUEngine Prop:

Property Value
multiProcessorCount 38
maxBlocksPerMultiProcessor 16
maxThreadsPerBlock 1024
maxThreadsPerMultiProcessor 1536

Means i can go
Max Blocks of 38x16 = 608
Max Threads 38x1536 = 58368 (as you mentioned)

and i can make some combinations like:

456x128 (i like this)
228 x 256
114 x 512 (as you mentioned)

Dose the combinations make a difference ?

Best Regards

It can’t be answered independent of your actual kernel code. For many kernel code designs, such variation will make little difference in performance, in my experience. However it is possible based on a specific kernel design that the number of threads per block is an important choice, perhaps even for correctness.