Instruction level parallelism

Hey,
I’ve read about the ILP capabilities of Kepler and Maxwell (Volkov’s talk for example http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf).
But I can’t find any references for ILP in the documentation or the architecture white papers.

As far as I can see, each CUDA core has integer and float ALU (but only one of them can be active at a time) and can do (F)MAD, but thats it. For ILP however, it would need more than that - at least multiple ALUs in each core should be active at the same time, but I can’t find clue for such hardware in those cores (and yet obviously ILP is something that is happening on Kepler and Maxwell). FMAD is kind of ILP, but is that the all ILP that is possible ?

Any explanations regarding that would be greatly appreciated.

“As far as I can see, each CUDA core has integer and float ALU (but only one of them can be active at a time)”

it sometimes does take effort to marry the more abstract and conceptual programming guide, with more hardware-based documents and constructs

from a purely hardware perspective, ‘core’ may be a misnomer
the kepler architecture whitepaper for example, describes a sm - the real processor or executioner of code (sm viewed inclusive of warps) - as:

“SMX: 192 single‐precision CUDA cores, 64 double‐precision units, 32 special function units (SFU), and 32 load/store units
(LD/ST).”

i would thus not overly emphasize cores, but view it more as a functional unit, and would describe a sm simply as having sp units, dp units, ld/ st units, and warp schedulers - in line with hardware diagrams, as found in said document

Thanks for that answer.

However the Volkov paper is talking about ILP on floating point operations level - meaning there should be multiple float (single) ALUs in a single core (lets say that a core is one lane of those 192). I guess I’ll try to contact him directly.

This is from Vasily and I think it makes it pretty clear.

192 / 32 = 6 warps

6 warps can ‘phone in’ some int/ sp arithmetic at the same time
a warp scheduler can distribute 2 instructions per warp if i am not mistaken, giving you a total of 8
you also have to consider secondary but compulsory instructions to ‘set up’ any arithmetic instruction/ execution - the MOV class of instructions; you need to move values in and out units
once a warp phoned in work, it waits - move to the background; other warps then move to the foreground
you have to consider both unit pipelines and rates at which they can accept work/ instructions; the 2 measures differ