prevent compiler reorder dependent instructions

LongY · September 14, 2014, 6:15pm

Suppose I have two chains of dependent instructions, which is as follows:
repeat64(a+=ab;)
repeat64(c+=cd;)

When I look at the assembly code, the compiler generates interleaved instructions to
increase the instruction level parallelism rather than keep these two chains of instructions
separated. a, b, c, and d are in registers.

I was wondering if there are some ways to prevent compiler reordering these instructions.

Thanks a lot for any input.

tera · September 14, 2014, 7:49pm

It’s possible as long as memory is involved:
preventing ptxas from reordering instructions - CUDA Programming and Performance - NVIDIA Developer Forums

I am not aware of a method preventing reordering of pure in-register calculations, although I would love to see one.

LongY · September 14, 2014, 8:46pm

Thanks. tera.
I actually found this post in the forum. It’s related to memory involved.
I am not sure whether it works for pure in-register instructions. The compiler reorders
these instructions to make them interleaved to increase instruction level parallelism.
I will continue to explore. Hopefully this issue can be fixed soon.

scottgray · September 15, 2014, 12:51pm

I’m curious as to why you’d want to keep them separate and not interleaved? Even if you have enough warps to fill the pipeline latencies, context switches are cheap but not free. I know Maxwell at least runs at a little over half speed under maximal switching. See my recent post about cuda-z:

[url]So what's new about Maxwell? - CUDA Programming and Performance - NVIDIA Developer Forums

Also, you may want to check out my assembler for Maxwell, which lets you do whatever you want. It wont help you if you’re still on Kepler, but the new performance cards are coming out soon. I’m still heavily tweaking the code but I’m pretty close now to documenting it all for general use. Sneak peek here:

[url]Google Code Archive - Long-term storage for Google Code Project Hosting.

LongY · September 15, 2014, 4:16pm

Thanks. scott. The reason why keep them separate is to see how instruction level parallelism works in the GPU.
I found out one solution to prevent the compiler optimization in Fermi. Using switch statement to force the compiler to continue execute the next instruction. Also, -Xptxas -O0 needs to be added when compiling to turn off ptxas optimization, otherwise, the generated instructions still are interleaved.

allanmac · September 15, 2014, 4:48pm

@LongY, I’ve found that a simple __syncthreads() is a good “semantic fence” that will stop reordering.

But if you want to avoid the synchronization you can emit a dummy “bar.arrive”:

Here’s a snippet from an old test:

LongY · September 15, 2014, 5:38pm

@allanmac, Thanks for pointing out this semantic fence idea. I just gave a quick test to see whether it works. It turns out that this is probably related to memory involving instructions. Without writing a or c back to array values, those instructions still are interleaved.

allanmac · September 15, 2014, 5:49pm

Ah, I see…

[s]Thinking out loud, what about something like this as a semantic fence:

asm volatile ("mov.u32 %0, %%clock;" : "=r"(x) :: "memory");

(I haven’t tested it – sorry)[/s]

That doesn’t work.

And if that doesn’t work you could try generating the independent FMA sequences in PTX and mark them as volatile.

That doesn’t work either. :)

I give up! A workaround is to create a small dependency between the two sequences.

Otherwise, you can get a Maxwell GPU and kick ptxas to the curb with Scott’s assembler.

Uncle_Joe · September 16, 2014, 12:27am

I’m curious as to why you’d want to keep them separate and not interleaved?

I can’t say for CUDA, but even on regular processors, the compiler can be too greedy about instruction scheduling. The greedy solution of scheduling loads as early as possible isn’t always optimal because it can prevent dual issue (which requires 2 instructions that execute on separate units to be back to back), or better yet, quad issue!

A very recent example of bad scheduling is when I software pipelined a loop like this:

int a[N];
int sum = 0;
for (int i = 0; i < N; ++i)
  sum += a[i];

to

int sum = 0;
prev = a[0];
for (int i = 1; i < N; i += 2)
{
  // each load and dependent add are separated by 2 independent instructions, greatly increasing ||ism
  next = a[i];
  sum += prev;
  prev = a[i + 1];
  sum += next;
}
sum += prev;

I was confused why the pipelined version wasn’t faster than the reference. After seeing the assembly code, I thought the compiler (Intel C++ compiler 11) was being greedy, so I wrote an assembly version (with same ordering as in C++), and it was indeed 1.2x faster than the reference.

Compiler experts, any comments on how schedulers avoid greed so that more instructions can dual or quad issue?

scottgray · September 16, 2014, 2:47pm

For the scheduler I wrote for maxas, I calculate what the prior instruction’s stall would be for every waiting instruction on the ready list. For instructions that can dual issue that would be zero and hence it would get prioritized over an instruction with 1 or more stalls. This is basically a 1 look ahead. I also use other heuristics like number of dependencies, and instruction type mixing. All else being equal I differ to the order an instruction appears in the assembly. This gives the author some degree of control of where loads occur in relation to other instructions.

For the code I’m writing this produces the near optimal ordering. The only way it might improve is with adding an additional instruction between memory operations as the memory units seem to operate more efficiently at half throughput (one memory instruction per two clocks). But I allow you to write code that isn’t scheduled, so for high performance sections (like inner loops) you have 100% control.

Topic		Replies	Views
preventing ptxas from reordering instructions CUDA Programming and Performance	23	6126	December 2, 2022
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3279	January 20, 2019
low level hardware documentation CUDA Programming and Performance	23	3557	November 28, 2014
PTX instructions are reordered CUDA Programming and Performance	12	1511	May 13, 2024
Maxwell Assembler CUDA Programming and Performance	13	6348	February 4, 2015
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1568	February 14, 2019
Is there a way to control instruction ordering? (and what is the difference between TEX.T and TEX.P? CUDA Programming and Performance	8	2153	April 28, 2014
Some issues regarding the use of prefetch in the cuda kernel CUDA Programming and Performance cuda , kernel	19	117	June 11, 2025
Understanding CUDA scheduling CUDA Programming and Performance	4	15589	May 20, 2014
Things related to stall reasons... or not so related CUDA Programming and Performance	6	2014	April 14, 2017

prevent compiler reorder dependent instructions

Related topics