Two dispatch units in the Kepler.. is it possible execute two instructions in a warp at the same tim

Hi

I would like to ask a question about Kepler’s architecture related with “Branch divergence”. I was reading the features of the kepler, and saw that it has two dispatch units for warp scheduler. It means that each warp scheduler can dispatch two instructions at the same time… and obviously execute those at the same time.
So, in these new warp scheduler, branch divergence will continue?
My doubt is because if now it’s possible that one warp can execute two instructions at the same time. This could execute the two paths of a conditional if.

For example, a code:
instA;
if (…)
instB;
else
instC;

Before, the threads of a warp execute something like this:

1 2 3 4 … (32)
A A A A …
B - B - …

  • C - C …

But, now in the kepler. Is it possible the following?

1 2 3 4 … (32)
A A A A …
B C B C …

Thanks… for your answer =)

Please… somebody could help me?

Or where can i find about that?

As far as I know, no NVIDIA GPU can combine the execution of partial warps, though I wish. Conceptually, I think this is quite simple to do (just find all threads that are at the same program location and issue them to the SIMD lanes), but it seems that would significantly complicate the hardware scheduler, and you would also have to give up the concept of a warp - there would then only be threads.

Kepler has 4 warp schedulers per SM, and each can issue 2 back to back instructions if they’re independent.

You’re not the only one who wants more performance from diverged codes. You can read about Andrew Glew’s proposals to improve utilization http://docs.google.com/fileview?id=0B5qTWL2s3LcQNGE3NWI4NzQtNTBhNS00YjgyLTljZGMtNTA0YjJmMGIzNDEw&hl=en

Bill Dally’s group also proposed temporal SIMD, which would completely remove the penalty of diverged codes, but to me, it just seems to be a mirage because you don’t increase parallelism - you just execute an N wide instruction over N cycles on a single arithmetic unit, and only reduce instruction fetch costs.

Thanks…

but there is a thing that i didn’t understand. If kepler has 2 dispatch units and issue 2 instructions per warp scheduler. How does it divide the warp?.. it means the 32 threads. For a example above, would be 16 threads execute instB and 16 threads execute instC?. Or is that not possible?

ex:
thread 1 2 3 4 … (32)
inst A A A A …
inst B C B C …

If it’s not possible… how the kepler would execute the code of the last example? Or how the kepler take advantage of those 2 dispatch units?

Somebody has any idea? Please1!

Two dispatch units mean dual issue. That’s all. And dual issue is about issuing two subsequent instructions from the same warp.

In fact, dual issue might work in your case if the branch is compiled into explicit predication masks.
For example, a code

if( threadIdx.x == 0 )
    x = p[0];
else 
    x = x + 1;

compiles into

S2R R3, SR_Tid_X;                //R3 = threadIdx.x;
ISETP.EQ.AND P0, pt, R3, RZ, pt; //P0 = R3==0;
@P0 LD R2, [R0];                 //if( P0) x = p[0];
@!P0 MOV R4, c [0x0] [0x144];    //if(!P0) load x;
@!P0 IADD R2, R4, 0x1;           //if(!P0) x = x + 1;

See the annotation on right if you are new to native ISA. (I had x passed as kernel argument. Sorry if it is confusing.)

Now, the instruction prefixed @P0 and the first instruction prefixed @!P0 can, possibly, be issued together. Which is sort of what you are talking about. But not exactly. In the first instructions some threads will be off. Other threads will be off in the second instructions. They won’t be combined.

Thank so much vvolkov…

So… every thread of a warp has to execute the same instruction, also in the kepler???

I thought that if the Kepler had two dispatch units, also it can execute two instructions at the same time. In that case, I didn’t understand how it was done?.. So I thought that some threads of that warp execute instructionB and the others threads of that warp execute instructionC.

But if it’s not possible… what is the advantagee of having two dispatch units?

Please… only to be sure.

For the basic architecture you may want to read the Anandtech articles:

http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/3
http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/2

and maybe others from the same place.

I know it’s a pity that Nvidia doesn’t document these things themselves but prefer to just tell them to journalists.

Thanks… but these articles don’t have the explanation about those features.

could somebody help me to answer my last questions?

yes, every thread of a warp has to execute the same instruction even on Kepler. Only this way the control logic can be reduced to 1/32th (having it just once for the whole warp).

Having multiple dispatch units allows you to execute multiple warps in parallel. The effect is similar to just having more multiprocessors, but with the added benefit that more warps are still able to communicate through shared memory.

Although Nvidia has recently only used it to effectively reduce the amount of shared memory (by sharing it among more warps) without breaking compatibility (the size per multiprocessor is still the same).

It should be able to issue two instructions, each with a different predicate, on a single cycle. Of course, this means it’s not issuing two instructions from the same predicate at the same time. Thus, you still get your total execution time divided between the two code paths.

Roughly speaking, it reads in two instructions at a time, checks if the second instruction uses the result from the first one, and executes both at the same time if it can. i doubt it matters whether the two instructions have the same lanes active. Hence, the two instructions could either be from the same side of an if statement or from separate sides.

To illustrate how this works, think of this bit of code:

result = (a+b)*(c+d);

This would boil down to somethink kinda like this:

ab = a+b
cd = c+d
result = ab*cd

Which would correspond to individual instructions.

Notice that a+b and c+d are completely independant from each other, and so can be executed at the same time.

Just for reference, here’s the NVidia documentation on the Kepler schedulers, from the GK110 whitepaper.

An interesting an undocumented question is about the limitations on dual-issue. Obviously if the second instruction depends on the output of the first, the dependencies prevent the second issue.

But I suspect (just from logic, not from knowledge) that the meaning of “independent instructions” may mean that the destination register of the two instructions also must be disjoint. Ie, you cannot dual issue an instruction like “A= B ? C+1 : D+1;” since both halves of the computation want to write into the A register. My logic is that if you allowed one register to be simultaneously written to from two different sets of SP cores, you’d need extra hardware logic to “select” the data from the two sets of incoming answers. And on top of this, you’d have to test two sets of predicates to mask the writes, not just one. That’s a lot more complex than allowing only one source of data with one predicate.

So I hypothesize that a KISS design of dual-issue would try to minimize the hardware complexity by disallowing dual issue of instructions that both write to the same register. If that hypothesis is correct, then the answer to juj_sca’s question whether dual-issue will reduce divergence is NO, at least for the common case of the same output register being used in both branches.

If we had only 1 dispatch unit per warp scheduler, then 4 schedulers could feed only 4x32 = 128 cores per cycle, which is less than 192 cores that are available.

This sentence is complicated to me, with realtion to your answers :


Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle

It says clearly… issue two instructions per warp… and I think that obviously execute those two instructions. So If the threads of a warp can execute only one instruction, how is possible dispatch and execute two instructions per warp in the kepler??

Please… i think to answer that question is the real problem…

The figure on page 10 of the GK110 whitepaper may help you visualize the dual-issue instructions a scheduler emits.

According to the figure… it dispatch 2 instructions per warp at the same time… So… how are distributed the threads of that warp?

To all active threads of the warp.

Please read http://en.wikipedia.org/wiki/Superscalar.

If a warp has two sequential instructions and the second instruction is not dependent on the first and the instructions are not issued to the same execution unit and all other conditions are met (different per architecture) then the warp scheduler can issue both instructions on same cycle. For example if you program had the instructions

ADD R0, R1, R2
LD R4, [R1+0]

The second instruction is not dependent on the first instruction. The first instruction is an ALU instruction and the second is a LSU instruction. There is a high likelihood that the warp scheduler can issue the instructions in the same cycle.

Please note that the Core i7 processor I’m sending this email on can issue 6 instructions from 1-2 threads per cycle per core.

The logic required for selecting which result to use is in fact identical to the logic required to prevent two instructions with the same destination register from executing is in fact identical.

In the one case, you have

if dest1 == dest2 then block issueport2

In the other you have

if (dest1,isport1write) == (dest2,isport2write) then block isport1write

Either of which compares two addresses and turns off a signal line if they’re the same. Note that the write pins have already been turned off if needed by the predicate.

Actually, the second one requires an additional bit in the comparitor, since you only want to block if there are actually two writes happening in the first place, but that’s pretty insignificant.

The only reason you’d want to consider blocking the instruction at the issue stage if the destinations are identical would be if the writeback stage was the bottleneck, and so adding logic there would force you to reduce clockspeed.