preventing ptxas from reordering instructions

Hi njuffa,

maybe it’s the lack of responsitveness that sometimes discourages people? Some reported bugs never get fixed.

In a conversation with @sjb3d, I thought of a nasty hack to prevent reordering:

static __device__ __forceinline__
void semanticFence()
{
#if __CUDA_ARCH__ >= 200
  asm("bar.arrive 15, 1536;"); // # of threads must be greater than 0
#endif
}

Explanation: each warp will prod the otherwise unused barrier #15 with 32 more threads.

I suspect this is nowhere near being a free operation but I can confirm that ptxas does treat it as a barrier and does not reorder code around it. This is no surprise.

What is unclear is if it is dangerous to use an unused barrier “slot” in this way.

Anyway, stay safe out there and don’t run with scissors. :)

Hyqneuron: In this case it’s entirely due to a lack of spare time on my side. I’m still eager to finish the report though, since this has been bothering me for so long.

Thanks for the PathScale reference. I’ll check their price and license. I had looked at their github repo before, but with zero documentation and producing only bare binary blobs the learning curve seemed a bit too steep.
However in general using a toolchain that is not supported by Nvidia is usually a showstopper. So this could at best serve for some experiments.

Allanmac: Interesting idea. Maybe asm(“bar.sync 15, 32;”); would do the trick.

I am also troubled by this problem. In order to avoid this problem, I often need to refactor the code lot. It would be great if the official could provide a pragma to prevent instruction rearrangement. e.g. asm(".pragma instruction_fence");