Instruction level parallelism

savage309 · June 8, 2015, 3:38pm

Hey,
I’ve read about the ILP capabilities of Kepler and Maxwell (Volkov’s talk for example http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf).
But I can’t find any references for ILP in the documentation or the architecture white papers.

As far as I can see, each CUDA core has integer and float ALU (but only one of them can be active at a time) and can do (F)MAD, but thats it. For ILP however, it would need more than that - at least multiple ALUs in each core should be active at the same time, but I can’t find clue for such hardware in those cores (and yet obviously ILP is something that is happening on Kepler and Maxwell). FMAD is kind of ILP, but is that the all ILP that is possible ?

Any explanations regarding that would be greatly appreciated.

little_jimmy · June 9, 2015, 4:46am

“As far as I can see, each CUDA core has integer and float ALU (but only one of them can be active at a time)”

it sometimes does take effort to marry the more abstract and conceptual programming guide, with more hardware-based documents and constructs

from a purely hardware perspective, ‘core’ may be a misnomer
the kepler architecture whitepaper for example, describes a sm - the real processor or executioner of code (sm viewed inclusive of warps) - as:

“SMX: 192 single‐precision CUDA cores, 64 double‐precision units, 32 special function units (SFU), and 32 load/store units
(LD/ST).”

i would thus not overly emphasize cores, but view it more as a functional unit, and would describe a sm simply as having sp units, dp units, ld/ st units, and warp schedulers - in line with hardware diagrams, as found in said document

savage309 · June 9, 2015, 10:26am

Thanks for that answer.

However the Volkov paper is talking about ILP on floating point operations level - meaning there should be multiple float (single) ALUs in a single core (lets say that a core is one lane of those 192). I guess I’ll try to contact him directly.

savage309 · June 9, 2015, 11:40am

Vasily Volkov:

On Maxwell you have 4 warp schedulers per multiprocessor (SM). Each can issue one (or two) instructions per cycle. If it can also issue two (dual-issue), as the whitepaper says, but not the CUDA programming guide, then these two come from the same warp.

Consider one of the warp schedulers. When an instruction is issued in cycle 0 from warp 0 on that scheduler, next cycle it may be issued from warp 1, then from warp 2, etc. Possibly, if there are lots of warps, then ILP has no effect at all, except in terms of dual-issue, i.e. in cycle X two instructions from warp Y are issued. Now suppose you run out from warps then what? You again can issue from the same warp, warp 0, but only if either the last instruction in that warp, which was issued in cycle 0, has already completed, or if the next instruction in that warp doesn’t depend on the previous instruction that is not done yet - which is ILP. This is orthogonal to instruction types, i.e. whether the instruction is SP/SFU/Load-Store, etc.

Curious fact, is that if you run 1 warp only and “all” instruction are independent (i.e. ILP is very large, like in dozens) then you still don’t get the peak performance on all GPUs but Maxwell. This means the warp schedulers can’t issue from the same warp in back-to-back issue cycles. E.g. on G80/GT200 you can issue from the same warp only once in 8 cycles on each scheduler, whereas the peak arithmetic throughput is 4 CPI. On Fermi I believe it is 6 cycles, the peak CPI per scheduler is 2, the latency is 18. So, ILP above 3 is not effective. On Maxwell, again, any ILP seems to be effective.

This is from Vasily and I think it makes it pretty clear.

little_jimmy · June 9, 2015, 12:02pm

192 / 32 = 6 warps

6 warps can ‘phone in’ some int/ sp arithmetic at the same time
a warp scheduler can distribute 2 instructions per warp if i am not mistaken, giving you a total of 8
you also have to consider secondary but compulsory instructions to ‘set up’ any arithmetic instruction/ execution - the MOV class of instructions; you need to move values in and out units
once a warp phoned in work, it waits - move to the background; other warps then move to the foreground
you have to consider both unit pipelines and rates at which they can accept work/ instructions; the 2 measures differ

Topic		Replies	Views
Can 4 warp schedulers be used to schedule 8 independent instructions across 1 warp? CUDA Programming and Performance	11	1525	November 26, 2016
Understanding CUDA scheduling CUDA Programming and Performance	4	16232	May 20, 2014
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12514	February 12, 2013
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	1060	June 17, 2024
Instruction Co-Issue on GK104 CUDA Programming and Performance	1	1738	June 20, 2012
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3323	February 5, 2012
max number of threads per core CUDA Programming and Performance	1	1355	May 8, 2019
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23283	October 12, 2010
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15911	February 4, 2011
A question about the correspondence between warp and core CUDA Programming and Performance	17	8137	February 1, 2019

Instruction level parallelism

Related topics