Question about device query information, 48 cuda cores * 32, versus 1024 limitation. Maximum perform

Question about device query information:

( 1) Multiprocessors x (48) CUDA Cores/MP: 48 CUDA Cores
GPU Clock Speed: 1.62 GHz
Memory Clock rate: 600.00 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64

What does the following numbers means:

  1. Maximum number of threads per block ? Does this mean the BlockDim3 (xyz is limited to 1024) ? or does it mean there can be 1024 active threads executing at the same time (so the BlockDim3 can be much larger, except some of them will not execute yet/they virtual) ? (Perhaps gpu can only store 1024 threads (thread execution data: instruction pointers/instruction registers) inside it’s memory, and thus it’s limited ?!?) But still… can’t it then simply swap in new threads from the BlockDim3 when it’s done with active threads ?!? (It this perhaps an implementation for the future ?)

It says cuda cores 48, each cuda core can execute a warp of 32 so that is 48*32 = 1536 threads in parallel, why is there a limit of 1024 ?? 50 procent being wasted ?

If it’s being wasted on a single block, can the 50% wasted execute another block ?

(Maybe some cuda cores where disabled in hardware, or they brain dead (production flaws) ? ;))

(Maybe driver/software limitations ?)

(Maybe hardware design mistake ? ;) :))

(Maybe 50% is used for “book-keeping” = helping other cuda cores run ?!? ;) :))

^ Wild crazy hypothesis :)

Question about device query information:

( 1) Multiprocessors x (48) CUDA Cores/MP: 48 CUDA Cores
GPU Clock Speed: 1.62 GHz
Memory Clock rate: 600.00 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64

What does the following numbers means:

  1. Maximum number of threads per block ? Does this mean the BlockDim3 (xyz is limited to 1024) ? or does it mean there can be 1024 active threads executing at the same time (so the BlockDim3 can be much larger, except some of them will not execute yet/they virtual) ? (Perhaps gpu can only store 1024 threads (thread execution data: instruction pointers/instruction registers) inside it’s memory, and thus it’s limited ?!?) But still… can’t it then simply swap in new threads from the BlockDim3 when it’s done with active threads ?!? (It this perhaps an implementation for the future ?)

It says cuda cores 48, each cuda core can execute a warp of 32 so that is 48*32 = 1536 threads in parallel, why is there a limit of 1024 ?? 50 procent being wasted ?

If it’s being wasted on a single block, can the 50% wasted execute another block ?

(Maybe some cuda cores where disabled in hardware, or they brain dead (production flaws) ? ;))

(Maybe driver/software limitations ?)

(Maybe hardware design mistake ? ;) :))

(Maybe 50% is used for “book-keeping” = helping other cuda cores run ?!? ;) :))

^ Wild crazy hypothesis :)

There is no thread-swapping on existing CUDA devices. Once a block is assigned to a multiprocessor, all of its threads are active and take ownership of resources (registers, shared memory, etc) until the block terminates. The block cannot be suspended, only terminated. If you request more blocks than can run simultaneously, only some of the blocks will be scheduled for execution. As blocks run to completion (or sometimes slightly less frequently than that), new blocks are started to fill the available slot until all blocks are finished.

See my other response to one of your posts about how entire warps are not assigned to a single CUDA core. A single multiprocessor can execute more than one block at once, if sufficient register and shared memory resources are available.

(And I would encourage you to take another read through the CUDA Programming Guide. Chapters 1, 2, 4, 5 answer a lot of your questions.)

There is no thread-swapping on existing CUDA devices. Once a block is assigned to a multiprocessor, all of its threads are active and take ownership of resources (registers, shared memory, etc) until the block terminates. The block cannot be suspended, only terminated. If you request more blocks than can run simultaneously, only some of the blocks will be scheduled for execution. As blocks run to completion (or sometimes slightly less frequently than that), new blocks are started to fill the available slot until all blocks are finished.

See my other response to one of your posts about how entire warps are not assigned to a single CUDA core. A single multiprocessor can execute more than one block at once, if sufficient register and shared memory resources are available.

(And I would encourage you to take another read through the CUDA Programming Guide. Chapters 1, 2, 4, 5 answer a lot of your questions.)