thanks for you reply! generally the compiler does try to optimize the instructions to obtain the best performance… but sometimes it doesn’t
if I let the compiler do what it liked ( i.e. no syncthreads() ) i found i was only able to reach 66% of my expected gflop, but when i did case 2 i got 91%. i thought the remaining percentage was due to the interlacing of the “data out” operation but I’ve found this was not the case, it was just memory latency
Thanks again, your question helped me figure it out!
That’s quite surprising, and would be good to investigate. Generally, adding a syncthreads should slightly decrease performance, particularly for larger block sizes, since it hurts latency hiding coming up to the sync (If half the threads in the block are already at the sync, then there are less threads to switch to to hide latency). Even in the best case, the sync instruction would take up 4 clock cycles, which could be used to perform a mad, for instance. The fact that you’re getting better performance with a syncthreads in there indicates that some compiler optimization is badly tuned, so more information would be helpful to everyone.
well it’s a little more complicated than just an additional syncthread instruction
i was doing mad operations c = a * b + c
where ‘a’ and ‘b’ where first suppose to be transferred from smem to regs and then perform the MAD
but when you don’t use syncthreads the compiler decides to only transfer ‘a’ and stream ‘b’ from smem. thus the mad operation looks like:
mad.rn.f32 $r2, s[$ofs2+0x0020], $r7, $r2
the problem is that MAD instructions that use operands from smem take about 6cc to complete, instead of 4cc, thus causing the reduction in performance
this behaviour is not exhibited if you put a syncthreads after the move ‘a’ & ‘b’ instructions as it forces both variables to transfer to the register space, thus ensuring all operands come from registers only:
mad.rn.f32 $r25, $r28, $r8, $r25
of course when i’m transferring ‘c’ back to smem, the order of instructions doesn’t matter, which i didn’t realize until you replied to my post External Image