Can 4 warp schedulers be used to schedule 8 independent instructions across 1 warp?

cheinger · November 25, 2016, 7:04am

Hey experts!

I have been reading up on ILP for maxwell and since an SMM as 4 warp schedulers, I was wondering if you could please clarify that since each warp scheduler can issue 2 instructions per warp (if they are independent), if it was possible to use all 4 warp schedulers to schedule 8 indepedent instructions across a single warp resulting in 256 instructions being executed concurrently?

Thanks!

tera · November 25, 2016, 10:41am

No. Each warp is assigned to exactly one scheduler.

cheinger · November 25, 2016, 10:55am

Oh ok thanks. So what does it mean when people say they get 4x IPL or 8x IPL? Technically can’t you only get 2x ILP?

tera · November 25, 2016, 11:13am

I don’t know. Do you have any reference to such a claim?

cheinger · November 25, 2016, 11:25am

On Page 10: [url]http://on-demand.gputechconf.com/gtc/2015/presentation/S5223-Ismayil-Guracar.pdf[/url]

tera · November 25, 2016, 12:15pm

Ah I see. The presentation refers to the number of consecutive independent instructions that can potentially be executed in parallel.

Only two instructions can be issued per cycle. However having more than two consecutive independent instructions still helps because each instruction also has a latency until it’s results become available. This latency is usually covered by issuing instructions from other warps. But by having multiple independent instructions lined up within the instruction stream of each warp, fewer warps (=lower occupancy) are needed to fully load the device.

cheinger · November 25, 2016, 1:25pm

Ohhh that makes way more sense now! Thanks a lot! :)

cheinger · November 26, 2016, 1:03am

Sorry just to clarify:

When you say only two instructions can be issued per cycle, is that across all the warps or per warp? Hence what is the maximum number of possible in-flight instructions you can you have per cycle across the SM? - is it 8 or 2?
Does having fewer warps and more ILP results in less active threads/cuda cores which therefore results in less occupancy? Does that mean that when instructions are issued from other warps to hide that latency that those threads/cuda cores are idle or is there a context switch? And if there’s a context switch then wouldn’t it use the same number of cores so the occupancy would be the same?

Thanks for your help :)

tera · November 26, 2016, 1:24am

The dual issue is per warp. So on architectures that have four warp schedulers per sm, up to eight instructions can be issued per cycle per sm.
Fewer warps per sm indeed means lower occupancy (occupancy is just the number of resident warps divided by the maximum number of warps).

However, there is no one-to-one correspondence between threads and cuda cores. “Core” in this context is a bit of a misnomer, as it actually just denotes a floating point unit (FPU). Each FPU has wants to be fed an instruction every cycle that may come from any of the warps of it’s associated warp scheduler.

Also, context switches are free. Or, more precisely, there are no context switches at all (unlike in conventional CPUs, no register content is switched out. Instead, the register set is large enough to hold all registers for all resident warps at the same time). This also implies that context switches don’t change the number of resident warps - they rather just pick one of the resident warps.

The advantage of more ILP with lower occupancy is that more resources (registers, shared memory) are available per warp, which can sometimes be put to good use to reduce the amount of communication necessary between warps.

cheinger · November 26, 2016, 2:31am

Ah I see! When you say each FPU “has to be fed an instruction” every cycle, is that not an inefficiency if the kernel can’t supply that? I.e 32 threads that add 1 to a float would only require 32 FPUs however if I had 512 cuda cores then all of them would execute that cycle resulting in some power inefficiencies?

tera · November 26, 2016, 10:25am

Yes. “Each FPU wants to be fed an instruction per cycle” would have been a better description. FPUs that are not fed an instruction consume less power.

cheinger · November 26, 2016, 10:35am

Interesting, thanks!

Topic		Replies	Views
Instruction level parallelism CUDA Programming and Performance	4	3075	June 9, 2015
a deep dive into Instruction-level parallelism CUDA Programming and Performance	17	5216	December 18, 2018
Warps and Occupancy CUDA Programming and Performance	4	4052	April 19, 2011
Clarifing the process of issuing instructions on CUDA devices CUDA Programming and Performance	5	371	March 26, 2024
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2190	March 19, 2011
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3233	February 5, 2012
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	66	January 31, 2025
How to keep the float pipe busy? CUDA Programming and Performance	7	728	April 23, 2019
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15650	February 4, 2011
Understanding CUDA scheduling CUDA Programming and Performance	4	15757	May 20, 2014

Can 4 warp schedulers be used to schedule 8 independent instructions across 1 warp?

Related topics