Suppose I have two chains of dependent instructions, which is as follows:
When I look at the assembly code, the compiler generates interleaved instructions to
increase the instruction level parallelism rather than keep these two chains of instructions
separated. a, b, c, and d are in registers.
I was wondering if there are some ways to prevent compiler reordering these instructions.
Thanks a lot for any input.
It’s possible as long as memory is involved:
I am not aware of a method preventing reordering of pure in-register calculations, although I would love to see one.
I actually found this post in the forum. It’s related to memory involved.
I am not sure whether it works for pure in-register instructions. The compiler reorders
these instructions to make them interleaved to increase instruction level parallelism.
I will continue to explore. Hopefully this issue can be fixed soon.
I’m curious as to why you’d want to keep them separate and not interleaved? Even if you have enough warps to fill the pipeline latencies, context switches are cheap but not free. I know Maxwell at least runs at a little over half speed under maximal switching. See my recent post about cuda-z:
Also, you may want to check out my assembler for Maxwell, which lets you do whatever you want. It wont help you if you’re still on Kepler, but the new performance cards are coming out soon. I’m still heavily tweaking the code but I’m pretty close now to documenting it all for general use. Sneak peek here:
Thanks. scott. The reason why keep them separate is to see how instruction level parallelism works in the GPU.
I found out one solution to prevent the compiler optimization in Fermi. Using switch statement to force the compiler to continue execute the next instruction. Also, -Xptxas -O0 needs to be added when compiling to turn off ptxas optimization, otherwise, the generated instructions still are interleaved.
@LongY, I’ve found that a simple __syncthreads() is a good “semantic fence” that will stop reordering.
But if you want to avoid the synchronization you can emit a dummy “bar.arrive”:
Here’s a snippet from an old test:
@allanmac, Thanks for pointing out this semantic fence idea. I just gave a quick test to see whether it works. It turns out that this is probably related to memory involving instructions. Without writing a or c back to array values, those instructions still are interleaved.
Ah, I see…
[s]Thinking out loud, what about something like this as a semantic fence:
asm volatile ("mov.u32 %0, %%clock;" : "=r"(x) :: "memory");
(I haven’t tested it – sorry)[/s]
That doesn’t work.
And if that doesn’t work you could try generating the independent FMA sequences in PTX and mark them as volatile.
That doesn’t work either. :)
I give up! A workaround is to create a small dependency between the two sequences.
Otherwise, you can get a Maxwell GPU and kick ptxas to the curb with Scott’s assembler.
I’m curious as to why you’d want to keep them separate and not interleaved?
I can’t say for CUDA, but even on regular processors, the compiler can be too greedy about instruction scheduling. The greedy solution of scheduling loads as early as possible isn’t always optimal because it can prevent dual issue (which requires 2 instructions that execute on separate units to be back to back), or better yet, quad issue!
A very recent example of bad scheduling is when I software pipelined a loop like this:
int sum = 0;
for (int i = 0; i < N; ++i)
sum += a[i];
int sum = 0;
prev = a;
for (int i = 1; i < N; i += 2)
// each load and dependent add are separated by 2 independent instructions, greatly increasing ||ism
next = a[i];
sum += prev;
prev = a[i + 1];
sum += next;
sum += prev;
I was confused why the pipelined version wasn’t faster than the reference. After seeing the assembly code, I thought the compiler (Intel C++ compiler 11) was being greedy, so I wrote an assembly version (with same ordering as in C++), and it was indeed 1.2x faster than the reference.
Compiler experts, any comments on how schedulers avoid greed so that more instructions can dual or quad issue?
For the scheduler I wrote for maxas, I calculate what the prior instruction’s stall would be for every waiting instruction on the ready list. For instructions that can dual issue that would be zero and hence it would get prioritized over an instruction with 1 or more stalls. This is basically a 1 look ahead. I also use other heuristics like number of dependencies, and instruction type mixing. All else being equal I differ to the order an instruction appears in the assembly. This gives the author some degree of control of where loads occur in relation to other instructions.
For the code I’m writing this produces the near optimal ordering. The only way it might improve is with adding an additional instruction between memory operations as the memory units seem to operate more efficiently at half throughput (one memory instruction per two clocks). But I allow you to write code that isn’t scheduled, so for high performance sections (like inner loops) you have 100% control.