Question about device query information, 48 cuda cores * 32, versus 1024 limitation. Maximum perform

Skybuck · June 16, 2011, 11:11pm

Question about device query information:

( 1) Multiprocessors x (48) CUDA Cores/MP: 48 CUDA Cores
GPU Clock Speed: 1.62 GHz
Memory Clock rate: 600.00 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64

What does the following numbers means:

Maximum number of threads per block ? Does this mean the BlockDim3 (xyz is limited to 1024) ? or does it mean there can be 1024 active threads executing at the same time (so the BlockDim3 can be much larger, except some of them will not execute yet/they virtual) ? (Perhaps gpu can only store 1024 threads (thread execution data: instruction pointers/instruction registers) inside it’s memory, and thus it’s limited ?!?) But still… can’t it then simply swap in new threads from the BlockDim3 when it’s done with active threads ?!? (It this perhaps an implementation for the future ?)

It says cuda cores 48, each cuda core can execute a warp of 32 so that is 48*32 = 1536 threads in parallel, why is there a limit of 1024 ?? 50 procent being wasted ?

If it’s being wasted on a single block, can the 50% wasted execute another block ?

(Maybe some cuda cores where disabled in hardware, or they brain dead (production flaws) ? External Image)

(Maybe driver/software limitations ?)

(Maybe hardware design mistake ? External Image External Image

(Maybe 50% is used for “book-keeping” = helping other cuda cores run ?!? External Image External Image

^ Wild crazy hypothesis :)

Skybuck · June 16, 2011, 11:11pm

Question about device query information:

( 1) Multiprocessors x (48) CUDA Cores/MP: 48 CUDA Cores
GPU Clock Speed: 1.62 GHz
Memory Clock rate: 600.00 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64

What does the following numbers means:

Maximum number of threads per block ? Does this mean the BlockDim3 (xyz is limited to 1024) ? or does it mean there can be 1024 active threads executing at the same time (so the BlockDim3 can be much larger, except some of them will not execute yet/they virtual) ? (Perhaps gpu can only store 1024 threads (thread execution data: instruction pointers/instruction registers) inside it’s memory, and thus it’s limited ?!?) But still… can’t it then simply swap in new threads from the BlockDim3 when it’s done with active threads ?!? (It this perhaps an implementation for the future ?)

It says cuda cores 48, each cuda core can execute a warp of 32 so that is 48*32 = 1536 threads in parallel, why is there a limit of 1024 ?? 50 procent being wasted ?

If it’s being wasted on a single block, can the 50% wasted execute another block ?

(Maybe some cuda cores where disabled in hardware, or they brain dead (production flaws) ? External Image)

(Maybe driver/software limitations ?)

(Maybe hardware design mistake ? External Image External Image

(Maybe 50% is used for “book-keeping” = helping other cuda cores run ?!? External Image External Image

^ Wild crazy hypothesis :)

seibert · June 16, 2011, 11:18pm

There is no thread-swapping on existing CUDA devices. Once a block is assigned to a multiprocessor, all of its threads are active and take ownership of resources (registers, shared memory, etc) until the block terminates. The block cannot be suspended, only terminated. If you request more blocks than can run simultaneously, only some of the blocks will be scheduled for execution. As blocks run to completion (or sometimes slightly less frequently than that), new blocks are started to fill the available slot until all blocks are finished.

See my other response to one of your posts about how entire warps are not assigned to a single CUDA core. A single multiprocessor can execute more than one block at once, if sufficient register and shared memory resources are available.

(And I would encourage you to take another read through the CUDA Programming Guide. Chapters 1, 2, 4, 5 answer a lot of your questions.)

seibert · June 16, 2011, 11:18pm

There is no thread-swapping on existing CUDA devices. Once a block is assigned to a multiprocessor, all of its threads are active and take ownership of resources (registers, shared memory, etc) until the block terminates. The block cannot be suspended, only terminated. If you request more blocks than can run simultaneously, only some of the blocks will be scheduled for execution. As blocks run to completion (or sometimes slightly less frequently than that), new blocks are started to fill the available slot until all blocks are finished.

See my other response to one of your posts about how entire warps are not assigned to a single CUDA core. A single multiprocessor can execute more than one block at once, if sufficient register and shared memory resources are available.

(And I would encourage you to take another read through the CUDA Programming Guide. Chapters 1, 2, 4, 5 answer a lot of your questions.)

Topic		Replies	Views
Test Multi Threading Spinning CUDA Programming and Performance	32	4809	July 20, 2011
CUDA - thread block confusion concept clearity sought CUDA Programming and Performance	6	2999	November 10, 2011
How many can use Blocks to effcient parallel prog CUDA Programming and Performance	8	5789	December 12, 2009
Thread Number Limitation CUDA Programming and Performance	3	3887	December 22, 2008
CUDA software and hardware mapping CUDA Programming and Performance	5	14675	February 21, 2009
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17187	September 11, 2011
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3972	February 9, 2024
threads and blocks CUDA Programming and Performance	3	1342	May 7, 2012
GPU: Blocks, Threads, Multiprocessors, and Cuda Cores clarification Help clarifying the terms CUDA Programming and Performance	6	21285	November 9, 2011
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1354	December 19, 2023

Question about device query information, 48 cuda cores * 32, versus 1024 limitation. Maximum perform

Related topics