Question about device query information:
( 1) Multiprocessors x (48) CUDA Cores/MP: 48 CUDA Cores
GPU Clock Speed: 1.62 GHz
Memory Clock rate: 600.00 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
What does the following numbers means:
- Maximum number of threads per block ? Does this mean the BlockDim3 (xyz is limited to 1024) ? or does it mean there can be 1024 active threads executing at the same time (so the BlockDim3 can be much larger, except some of them will not execute yet/they virtual) ? (Perhaps gpu can only store 1024 threads (thread execution data: instruction pointers/instruction registers) inside it’s memory, and thus it’s limited ?!?) But still… can’t it then simply swap in new threads from the BlockDim3 when it’s done with active threads ?!? (It this perhaps an implementation for the future ?)
It says cuda cores 48, each cuda core can execute a warp of 32 so that is 48*32 = 1536 threads in parallel, why is there a limit of 1024 ?? 50 procent being wasted ?
If it’s being wasted on a single block, can the 50% wasted execute another block ?
(Maybe some cuda cores where disabled in hardware, or they brain dead (production flaws) ? )
(Maybe driver/software limitations ?)
(Maybe hardware design mistake ?
(Maybe 50% is used for “book-keeping” = helping other cuda cores run ?!?
^ Wild crazy hypothesis :)