Relationship between Threads and GPU core/units

The Kelper G110 white paper states that each SMX has

  • 192 single precision cuda cores (spcc) 64 double precision units (dpu) 32 SFU 32 LD/ST

This gives a total of 320 compute units

In addition, it also states that the Max Thread / Multiprocessor (SMX) = 2048

Question: How does the number of threads in each SMX (2048) relates to the number of spcc, dpu, SFU and LD/ST?

you could relate it to the number of cores (not processes) and fpus of a cpu
i.e. the cpu has x cores, and therefore x max processes can be active (running on the cpu) at any given time; each core also has access to y fpus at any given time

if you take the quoted numbers/ specifications, and convert it to warps, you get the max number of warps a sm can mind at any time (most importantly arguably in terms of instruction pointer), and the max number of warps that can have access to the said functional units at any given time
in the above, access implies issuing instructions, and does not explicitly consider the pipelines of the functional units

Further thoughts:

(1) Assuming that the 192 spcc and the 64 dpu are the only components in the SMX that perform computation work, this means that each SMX will have a total of 192+64=256 compute cores/units. Dividing the Max Thread / Multiprocessor (i.e. 2048) by the 256 compute cores/units, this means that each compute core/unit is comprised of 8 compute threads. Is this correct?

(2) The white paper on Kepler GK110 is seen here https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf Referring to the diagram of the Streaming Multiprocessor (SMX) Architecture on page 9, and based on deductions mentioned above, does this means that 1 warp (32 threads) actually operates 3 spcc (i.e. 3x8=24 single precision threads) and 1 dpu (i.e. 8 double precision threads) at any moment? If this is true, and if the compute task only involves single precision variables, does this mean that each warp is only 3/4 operationally effective while 1/4 of a warp threads remain idle? Also would the converse be true if the compute task only involves double precision variables?

Appreciate help on these questions?

the sm can only issue that many instructions to warps via its schedulers at any time
one should consider this too in the context of optimal number of functional units
elementary example: if the sm can only schedule 2 warps, 2 of a total count of 4 functional units may be redundant
thereafter, you need to reintroduce the pipelines of these functional units

the common denominator is still warps - a sm operates on and does transactions in terms of warps
this generally applies across the board
for example, the sm schedules warps, not threads

“compute threads”
cpus do not generally work with compute threads, nor do gpus
there is some parallelism in terms of simd of cpus; still, simd does not imply compute threads, but merely that functional units are happy to execute the same instruction on multiple data avenues, paths or simply sources

if you look at the assembler document (ptx isa) you would quickly note that a) threads are generally conceptual, a point the programming guide also raises, b) the most important property of a thread is perhaps its instruction pointer
hence, you could perhaps think of threads as simply one of many instruction pointers into you (kernel) code, pointing to instructions eventually to be executed by physical functional units, on a warp basis

Hi little-jimmy.

Thank you for replying.
Can you help me by answering my questions (1) and (2) plainly (and for the moment pls disregard factors such as warp schedulers, dispatch, register file)?

You may disregard the word “compute” from the phrase “compute thread”. I used the word “compute” in connection with my assumption. I simply meant thread the way NVIDIA described it http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy.

Can you provide me the link to the assembler document that you had mentioned ?

When you wrote “instruction pointer” do you mean “register file”?

Based on your last paragraph, how would you then relate a warp to the single precision cuda cores (spcc) and double precision units (dpu) mentioned by NVIDIA? This brings us back to my question 2, which is to understand the physical meaning of a warp and its relationship with the spcc and dpu.

I have answered most of your questions in the stackoverflow answer and comments for How do CUDA blocks/warps/threads map onto CUDA cores at http://stackoverflow.com/questions/10460742/how-do-cuda-blocks-warps-threads-map-onto-cuda-cores/10467342#10467342. I do not like that the answers are scattered so I’ll try to give a brief answer here.

A SMX consists of 4 subpartitions each containing a warp scheduler, resources (register file, scheduler slots) and execution units. The SMX also contains shared execution units such as texture unit, shared memory unit, and double precision units.

The compute work distributor distributes thread blocks to an SMX when the SMX has sufficient available resources for the thread block. The thread blocks is divided into warps. Each warp is allocated to a SM subpartition and warp resources such as registers are allocated. A warp will stay on the specific subpartition until is completes. When it completes its resources will be freed.

Each cycle each warp scheduler will pick an eligible warp (not stalled) and issue 1 or 2 instructions from the warp. These instructions will be dispatched to execution units (single precision/integer unit, double precision unit, special function unit, load store unit, texture unit, shared memory unit, etc. Each of the execution units are pipelined so the warp scheduler can execute instructions from the same warp or a different warp N cycles later. ALU instructions tend to have fixed latency (measurable by microbenchmarks) whereas SMX shared units such as the double precision unit and memory unit such as shared memory and texture unit have variable latency.

The reason the SMX can manage 2048 threads = 64 warps is so that each warp scheduler has a sufficient pool of warps to hide long latency instructions or to hide short latency instructions without adding the area and power cost of out of order execution.