Cuda Cores Cuda Cores - run threads bloocks, kernels etc.

Stu2000 · February 18, 2011, 7:21pm

Hello there,
I know that a block can run 512 threads, and you can have 65,535 blocks. Also i know that threads within a block can communicate with each other which is very useful, but my main point is if you have 65535 blocks running with 1 thread each on a gpu with 300 cuda cores will 300 blocks run at once? ie 300 simultatious actions? or do you have to have 1 block with 512 threads to run to take advantage of the 300 cores? I know theres something about a warp and 32 threads running at once… Could someone sum it all up to me? Would 10 blocks running each with 32 threads be optimal?

Sincerely,
Stuart P.

CudaDeveloper · February 19, 2011, 2:23am

Stuart,

The CUDA cores you are referring to are actually just vector engines. They run in blocks of 32 (the warp size) over the threads in a given block. There are limits to the number of blocks you can put onto a given SM. A GPU consists of several SMs each of which contain N CUDA cores.

If you tried to schedule 300 blocks of 1 thread each you’d get very bad performance because the lowest scheduling unit is 32 CUDA cores. You needs LOTS of threads on each core, and not lots of blocks. Typically you see only between six to eight blocks resident on an SM at any one time.

A good rule of thumb is aim for either 192 or 256 threads per block and you’ll be fine as both of these values produce pretty good loading onto the hardware

Stu2000 · February 19, 2011, 9:52am

Thanks for that, that rule of thumb is probably the easiest part to understand.

Am I right in thinking:

SM stands for Streaming multiprocessor.
GTX 480 has 480 SM’s iXBT Labs - NVIDIA GeForce GTX 480 Graphics Card - Page 1: Introduction, specs, key features
6 to 8 blocks are on a given SM
192 threads minimum per block means minimum 1152 threads per SM
1152 threads per SM * 480 SMs means for insane computation I want about 552960 threads scheduled across 2,880 blocks
An SM only handles N blocks at a time, whilst the others are loading in order to hide memory loading latencies, hence 1 thread per block would result in each SM running just N threads instead of 32*N?

Stu

avidday · February 19, 2011, 10:46am

The GTX480 has 15 SM, each has 32 scalar cores which execute like a wide SIMD or vector machine.

Stu2000 · February 19, 2011, 12:07pm

Ah k so its more like 17280 threads across 90 blocks for 6 blocks per SM and 192 threads per block.

Stu

tera · February 22, 2011, 1:31pm

Those numbers don’t look bad, but the reasoning for it mixes some unrelated numbers.

On compute capability 1.x devices you should have at least 192 threads or 6 warps active per SM (not per block). On a 2.0 device, you should have at least 576 active threads per SM.

Topic		Replies	Views
Basic Cuda Confusion - help CUDA Programming and Performance	9	1902	February 11, 2013
help me understand cuda CUDA Programming and Performance	4	6879	February 10, 2010
Scheduling Thread Blocks CUDA Programming and Performance	5	1180	July 29, 2021
GPU: Blocks, Threads, Multiprocessors, and Cuda Cores clarification Help clarifying the terms CUDA Programming and Performance	6	21293	November 9, 2011
How many threads can be running on a cuda core? CUDA Programming and Performance	1	8130	May 17, 2019
Optimal threads vs blocks CUDA Programming and Performance	4	4023	April 24, 2011
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27706	February 15, 2010
Which entity will execute one block? A single Cuda core or a SM? CUDA Programming and Performance	13	17085	December 7, 2010
Thread Scheduling Concept CUDA Programming and Performance	3	3715	June 21, 2012
finding the best number of threads per block CUDA Programming and Performance	3	7849	January 29, 2010

Cuda Cores Cuda Cores - run threads bloocks, kernels etc.

Related topics