Relationship between Threads and GPU core/units

sunbearc22 · November 19, 2015, 9:01am

The Kelper G110 white paper states that each SMX has

192 single precision cuda cores (spcc) 64 double precision units (dpu) 32 SFU 32 LD/ST

This gives a total of 320 compute units

In addition, it also states that the Max Thread / Multiprocessor (SMX) = 2048

Question: How does the number of threads in each SMX (2048) relates to the number of spcc, dpu, SFU and LD/ST?

little_jimmy · November 19, 2015, 2:42pm

you could relate it to the number of cores (not processes) and fpus of a cpu
i.e. the cpu has x cores, and therefore x max processes can be active (running on the cpu) at any given time; each core also has access to y fpus at any given time

if you take the quoted numbers/ specifications, and convert it to warps, you get the max number of warps a sm can mind at any time (most importantly arguably in terms of instruction pointer), and the max number of warps that can have access to the said functional units at any given time
in the above, access implies issuing instructions, and does not explicitly consider the pipelines of the functional units

sunbearc22 · November 19, 2015, 2:52pm

Further thoughts:

(1) Assuming that the 192 spcc and the 64 dpu are the only components in the SMX that perform computation work, this means that each SMX will have a total of 192+64=256 compute cores/units. Dividing the Max Thread / Multiprocessor (i.e. 2048) by the 256 compute cores/units, this means that each compute core/unit is comprised of 8 compute threads. Is this correct?

(2) The white paper on Kepler GK110 is seen here [url]https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf[/url] Referring to the diagram of the Streaming Multiprocessor (SMX) Architecture on page 9, and based on deductions mentioned above, does this means that 1 warp (32 threads) actually operates 3 spcc (i.e. 3x8=24 single precision threads) and 1 dpu (i.e. 8 double precision threads) at any moment? If this is true, and if the compute task only involves single precision variables, does this mean that each warp is only 3/4 operationally effective while 1/4 of a warp threads remain idle? Also would the converse be true if the compute task only involves double precision variables?

Appreciate help on these questions?

little_jimmy · November 20, 2015, 4:57am

the sm can only issue that many instructions to warps via its schedulers at any time
one should consider this too in the context of optimal number of functional units
elementary example: if the sm can only schedule 2 warps, 2 of a total count of 4 functional units may be redundant
thereafter, you need to reintroduce the pipelines of these functional units

the common denominator is still warps - a sm operates on and does transactions in terms of warps
this generally applies across the board
for example, the sm schedules warps, not threads

“compute threads”
cpus do not generally work with compute threads, nor do gpus
there is some parallelism in terms of simd of cpus; still, simd does not imply compute threads, but merely that functional units are happy to execute the same instruction on multiple data avenues, paths or simply sources

if you look at the assembler document (ptx isa) you would quickly note that a) threads are generally conceptual, a point the programming guide also raises, b) the most important property of a thread is perhaps its instruction pointer
hence, you could perhaps think of threads as simply one of many instruction pointers into you (kernel) code, pointing to instructions eventually to be executed by physical functional units, on a warp basis

sunbearc22 · November 20, 2015, 4:59pm

Hi little-jimmy.

Thank you for replying.
Can you help me by answering my questions (1) and (2) plainly (and for the moment pls disregard factors such as warp schedulers, dispatch, register file)?

You may disregard the word “compute” from the phrase “compute thread”. I used the word “compute” in connection with my assumption. I simply meant thread the way NVIDIA described it [url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy[/url].

Can you provide me the link to the assembler document that you had mentioned ?

When you wrote “instruction pointer” do you mean “register file”?

Based on your last paragraph, how would you then relate a warp to the single precision cuda cores (spcc) and double precision units (dpu) mentioned by NVIDIA? This brings us back to my question 2, which is to understand the physical meaning of a warp and its relationship with the spcc and dpu.

Greg · November 21, 2015, 4:06am

I have answered most of your questions in the stackoverflow answer and comments for How do CUDA blocks/warps/threads map onto CUDA cores at [url]gpgpu - How do CUDA blocks/warps/threads map onto CUDA cores? - Stack Overflow. I do not like that the answers are scattered so I’ll try to give a brief answer here.

A SMX consists of 4 subpartitions each containing a warp scheduler, resources (register file, scheduler slots) and execution units. The SMX also contains shared execution units such as texture unit, shared memory unit, and double precision units.

The compute work distributor distributes thread blocks to an SMX when the SMX has sufficient available resources for the thread block. The thread blocks is divided into warps. Each warp is allocated to a SM subpartition and warp resources such as registers are allocated. A warp will stay on the specific subpartition until is completes. When it completes its resources will be freed.

Each cycle each warp scheduler will pick an eligible warp (not stalled) and issue 1 or 2 instructions from the warp. These instructions will be dispatched to execution units (single precision/integer unit, double precision unit, special function unit, load store unit, texture unit, shared memory unit, etc. Each of the execution units are pipelined so the warp scheduler can execute instructions from the same warp or a different warp N cycles later. ALU instructions tend to have fixed latency (measurable by microbenchmarks) whereas SMX shared units such as the double precision unit and memory unit such as shared memory and texture unit have variable latency.

The reason the SMX can manage 2048 threads = 64 warps is so that each warp scheduler has a sufficient pool of warps to hide long latency instructions or to hide short latency instructions without adding the area and power cost of out of order execution.

Topic		Replies	Views
Scheduling threads as Warps CUDA Programming and Performance	3	927	July 11, 2013
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	978	June 17, 2024
What is the difference between SP and CUDA core? CUDA Programming and Performance	7	8227	October 12, 2021
questions about sp and sm CUDA Programming and Performance	5	4222	June 19, 2019
thread, warp, block, grid, device CUDA Programming and Performance	3	7088	November 25, 2016
A question about the correspondence between warp and core CUDA Programming and Performance	17	8030	February 1, 2019
max number of threads per core CUDA Programming and Performance	1	1339	May 8, 2019
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1489	April 15, 2009
No.of threads per scalar processor CUDA Programming and Performance	6	6581	July 10, 2009
Understanding CUDA scheduling CUDA Programming and Performance	4	16015	May 20, 2014

Relationship between Threads and GPU core/units

Related topics