Isolate MAD & MOV avoid interlacing of instruction types

Good Day,

This is my kernel:

[list=1]

[*]move data in

[*]do MAD instructions

[*]move data out

Each step has many MOV or MAD instructions. My goal is to keep these instructions grouped together (i.e. prevent interlacing of MAD and MOV)

How? External Image

I’ve tried 2 cases:

Case 1: Separate steps with 2 syncthreads() instructions

[list=1]

[*]move data in

[*]synctreads()

[*]do MAD instructions

[*]syncthreads()

[*]move data out

The result? The first sync is ignored! Thus Steps (1) and (3) are interlaced, but step (5) is contiguous. (see decuda code attached)

Case 2: Separate steps with 1 syncthreads() instruction

[list=1]

[*]move data in

[*]syncthreads()

[*]do MAD instructions

[*]move data out

The result? Steps (1) is contiguous, but steps (3) and (4) are interlaced. (see decuda code attached)

Do you know any tricks that might help? External Image
Case2_decuda.txt (1.88 KB)
Case1_decuda.txt (1.92 KB)

Why would you want to separate the mads and the movs? Interlacing them can increase performance by preventing stalls.

thanks for you reply! generally the compiler does try to optimize the instructions to obtain the best performance… but sometimes it doesn’t

if I let the compiler do what it liked ( i.e. no syncthreads() ) i found i was only able to reach 66% of my expected gflop, but when i did case 2 i got 91%. i thought the remaining percentage was due to the interlacing of the “data out” operation but I’ve found this was not the case, it was just memory latency

Thanks again, your question helped me figure it out!

That’s quite surprising, and would be good to investigate. Generally, adding a syncthreads should slightly decrease performance, particularly for larger block sizes, since it hurts latency hiding coming up to the sync (If half the threads in the block are already at the sync, then there are less threads to switch to to hide latency). Even in the best case, the sync instruction would take up 4 clock cycles, which could be used to perform a mad, for instance. The fact that you’re getting better performance with a syncthreads in there indicates that some compiler optimization is badly tuned, so more information would be helpful to everyone.

well it’s a little more complicated than just an additional syncthread instruction

i was doing mad operations c = a * b + c
where ‘a’ and ‘b’ where first suppose to be transferred from smem to regs and then perform the MAD

but when you don’t use syncthreads the compiler decides to only transfer ‘a’ and stream ‘b’ from smem. thus the mad operation looks like:

mad.rn.f32 $r2, s[$ofs2+0x0020], $r7, $r2

the problem is that MAD instructions that use operands from smem take about 6cc to complete, instead of 4cc, thus causing the reduction in performance

this behaviour is not exhibited if you put a syncthreads after the move ‘a’ & ‘b’ instructions as it forces both variables to transfer to the register space, thus ensuring all operands come from registers only:

mad.rn.f32 $r25, $r28, $r8, $r25

of course when i’m transferring ‘c’ back to smem, the order of instructions doesn’t matter, which i didn’t realize until you replied to my post External Image