I have been reading up on ILP for maxwell and since an SMM as 4 warp schedulers, I was wondering if you could please clarify that since each warp scheduler can issue 2 instructions per warp (if they are independent), if it was possible to use all 4 warp schedulers to schedule 8 indepedent instructions across a single warp resulting in 256 instructions being executed concurrently?
No. Each warp is assigned to exactly one scheduler.
Oh ok thanks. So what does it mean when people say they get 4x IPL or 8x IPL? Technically can’t you only get 2x ILP?
I don’t know. Do you have any reference to such a claim?
Ah I see. The presentation refers to the number of consecutive independent instructions that can potentially be executed in parallel.
Only two instructions can be issued per cycle. However having more than two consecutive independent instructions still helps because each instruction also has a latency until it’s results become available. This latency is usually covered by issuing instructions from other warps. But by having multiple independent instructions lined up within the instruction stream of each warp, fewer warps (=lower occupancy) are needed to fully load the device.
Ohhh that makes way more sense now! Thanks a lot! :)
The dual issue is per warp. So on architectures that have four warp schedulers per sm, up to eight instructions can be issued per cycle per sm.
Fewer warps per sm indeed means lower occupancy (occupancy is just the number of resident warps divided by the maximum number of warps).
However, there is no one-to-one correspondence between threads and cuda cores. “Core” in this context is a bit of a misnomer, as it actually just denotes a floating point unit (FPU). Each FPU has wants to be fed an instruction every cycle that may come from any of the warps of it’s associated warp scheduler.
Also, context switches are free. Or, more precisely, there are no context switches at all (unlike in conventional CPUs, no register content is switched out. Instead, the register set is large enough to hold all registers for all resident warps at the same time). This also implies that context switches don’t change the number of resident warps - they rather just pick one of the resident warps.
The advantage of more ILP with lower occupancy is that more resources (registers, shared memory) are available per warp, which can sometimes be put to good use to reduce the amount of communication necessary between warps.
Ah I see! When you say each FPU “has to be fed an instruction” every cycle, is that not an inefficiency if the kernel can’t supply that? I.e 32 threads that add 1 to a float would only require 32 FPUs however if I had 512 cuda cores then all of them would execute that cycle resulting in some power inefficiencies?
Yes. “Each FPU wants to be fed an instruction per cycle” would have been a better description. FPUs that are not fed an instruction consume less power.