preventing ptxas from reordering instructions

Sometimes ptxas and I disagree on the optimal ordering of instructions. In that case I usually resort to inserting

asm volatile("membar.cta;");

into the the code to prevent ptxas from reordering (memory involving) instructions around it. However it has a finite cost at runtime and doesn’t work for instructions only involving register operands.

Is there a similar PTX instruction that would completely compile away (apart from the reordering constraint) and also work on all kinds of instructions?

If not, may I request that this is included in a future version of the PTX instruction set?

On a related issue, is it possible to prevent ptxas from emitting moves right after texture fetches? In the case in question the data is not going to be used for hundreds of cycles, and I have dozens of spare registers available. Yet ptxas insists on putting moves right after the texture fetches, uselessly stalling execution for more than 500 cycles.

This has haunted me for years in cases where registers were scarce and the register layout was somewhat fixed due to a large number of texture instructions. In this case however I can see absolutely no reason for the movs.

I’m sure that designers of high performance shaders must have come across and solved this problem long before CUDA became en-vogue. Can anyone give me some pointers to the solution?

That sounds like an optimization bug in ptxas, and you should consider filing a bug report with an example that reproduces it. It shouldn’t leave useless moves in the program, especially if they cause scheduling hazards. You probably want to make sure that the move is actually useless first, there can be some tricky cases where moves look useless but actually serve a purpose (e.g. splitting live ranges).

Regarding your first question, instruction scheduling combined with register allocation are computationally intractable problems so you will always be able to find examples where the compiler doesn’t do the right thing. I can’t think of a way to force ptxas to only reorder some instructions, but not others (you effectively want a zero-cost barrier, maybe someone else can comment). You might try splitting the basic block (this is probably hard/unsupported in inline PTX) and using the ptxas option (–dont-merge-basicblocks).

I do appreciate the desire for more fine-grained control.

Thanks Gregory!

Yes, the zero-cost barrier is exactly what I want, I should probably have worded this more clearly. Something like asm volatile(“”) would provide on the PTX level, just extending all the way to shader assembly.

asm volatile(“membar.cta;”) did just what I want (at least for instructions involving memory) on compute capability 1.x and was very useful, it’s just not zero-cost anymore on newer architectures.

Regarding the useless movs it seems to me they stem from movs in the PTX output of cicc necessary to achieve SSA form that ptxas somehow is not capable of eliminating. I’ll try and see if I can remove them with a bit of handcrafted inline assembly.

No, it doesn’t seem to be related to SSA. Breaking SSA form using inline assembly didn’t change anything, at least regarding the texture cache (there seem to be loads of duplicated movs related to SSA, but they just bloat the code without slowing anything down).

It looks more like the fixed register layout of texture fetches is the problem. ptxas somehow sets up the registers for the next texture fetch immediately after the previous one using the same set of registers, instead of delaying it as long as possible. I.e. it seems like it wants to hide read-after-write register latency instead of texture cache / global memory latency.

The same happens with loads through the texture cache on sm_35: Immediately following the LDG.CT.128 the data is moved to different registers even though the original registers are not used again. What the **** is going on here?

I really need to extract a testcase and file a bug once I’ve got the time. But I’m unable to produce any nontrivial code that does not exhibit this problem. I can’t believe nobody else ever came across this problem, particularly within Nvidia?

I cannot say that I have encountered this curious code generation, it may be limited to specific contexts, or no longer present in the internal builds which I use most of the time. In general the compiler tries to schedule loads and first use of the load data far apart, even to the point where this starts to increase register pressure and reduce occupancy due to long live ranges. Is your code compiled with any register usage constraints? If so, what happens when you remove these constraints?

Thank you Norbert.

Removing the register limit does use a few more registers, but the MOVs stay right after the LDG.CT instructions.

I have to add that I’ve seen this behavior since at least CUDA 2.3 (probably since 2.1). I had come to think there must be a hardware limitation that would prevent hiding the latency anyway, so ptxas would have a reason not to care. Only yesterday I undertook the timing experiments to demonstrate that the hardware is capable of fully hiding the latency, but ptxas’ scheduling prevents it for any nontrivial code I’ve written.

Based on what you describe I would consider the submission of a bug report as very helpful. Thanks!

I’m working on the testcase right now. Thank you Norbert!

I very much agree with tera - I also need this kind of leverage on the code generation. So far, the best solution found is this:

if( clock() == 0 ) some_useless_variable++;

Here is a reproducer. We start with:

__global__ void reduce( float *dst, float *src )
    float a = src[0];
    float b = src[1];
    float c = src[2];
    float d = src[3];
    dst[0] = a+b+c+d;

which compiles into:

/*0008*/     /*0x10005de428004001*/ 	MOV R1, c [0x0] [0x44];
	/*0010*/     /*0x10001de428004005*/ 	MOV R0, c [0x0] [0x144];
	/*0018*/     /*0x10015c034800c000*/ 	IADD R5, R0, 0x4;
	/*0020*/     /*0x0050dc8580000000*/ 	LD R3, [R5];
	/*0028*/     /*0xf0511c8583ffffff*/ 	LD R4, [R5+-0x4];
	/*0030*/     /*0x10509c8580000000*/ 	LD R2, [R5+0x4];
	/*0038*/     /*0x0c40dc0050000000*/ 	FADD R3, R4, R3;
	/*0048*/     /*0x20501c8580000000*/ 	LD R0, [R5+0x8];
	/*0050*/     /*0x0830dc0050000000*/ 	FADD R3, R3, R2;
	/*0058*/     /*0x00009de428004005*/ 	MOV R2, c [0x0] [0x140];
	/*0060*/     /*0x00301c0050000000*/ 	FADD R0, R3, R0;
	/*0068*/     /*0x00201c8590000000*/ 	ST [R2], R0;
	/*0070*/     /*0x00001de780000000*/ 	EXIT;

Note that it interleaved adds and loads, which is not what I want. But if I use the device above:

global void reduce( float *dst, float *src )
float a = src[0];
float b = src[1];
float c = src[2];
float d = src[3];
if( clock() == 0 ) a+=1.f;
dst[0] = a+b+c+d;

I get:

/*0008*/     /*0x10005de428004001*/ 	MOV R1, c [0x0] [0x44];
	/*0010*/     /*0x10019de428004005*/ 	MOV R6, c [0x0] [0x144];
	/*0018*/     /*0x00001de428004005*/ 	MOV R0, c [0x0] [0x140];
	/*0020*/     /*0x00615c8580000000*/ 	LD R5, [R6];
	/*0028*/     /*0x10611c8580000000*/ 	LD R4, [R6+0x4];
	/*0030*/     /*0x2060dc8580000000*/ 	LD R3, [R6+0x8];
	/*0038*/     /*0x30609c8580000000*/ 	LD R2, [R6+0xc];
	/*0048*/     /*0x40019c042c000001*/ 	S2R R6, SR_ClockLo;
	/*0050*/     /*0x0051dc005000cfe0*/ 	FADD R7, R5, 0x3f800;
	/*0058*/     /*0x14715c23310c0000*/ 	ICMP.EQ R5, R7, R5, R6;
	/*0060*/     /*0x10511c0050000000*/ 	FADD R4, R5, R4;
	/*0068*/     /*0x0c40dc0050000000*/ 	FADD R3, R4, R3;
	/*0070*/     /*0x08309c0050000000*/ 	FADD R2, R3, R2;
	/*0078*/     /*0x00009c8590000000*/ 	ST [R0], R2;
	/*0088*/     /*0x00001de780000000*/ 	EXIT;

Which does the job. Cool, heh?

PS. Wasn’t here a “preview” button?

I notice it was still April 1st in California thirteen hours ago but not in my timezone. ;)

Vasily, have you timed membar.cta performance to back the claim that dealing that one in for three other instructions would be a gain?

It’s still good to know though that accessing %clock is treated specially by ptxas. I used to put membar.cta around clock() calls where I didn’t bother to (closely) inspect the code, just in case.

Regarding the useless movs I still need to extract the testcase but I’ve seen some pretty weird behavior where totally unrelated changes would make ptxas emit good code instead - unfortunately connected with big penalties elsewhere though. Hope to get to that soon.

P.S.: There is a"postview" aka edit button that sometimes even works. I notice though that the forum tends to have problems with multiple [code] sections in one post.

Hm… that’s an interesting thought. I used to think that memory fences are very expensive, but that might be not the case anymore. Especially if that’s a fence across a thread block, not a global fence. Thanks for the pointer!

Possibly you don’t need to use inline assembly - __threadfence_block() should suffice.

It is common when totally unrelated changes improve quality of the code generated. I’ve seen this with both nvcc and gcc, I guess you should always expect that from a compiler.

Thanks for the __threadfence_block() pointer!
I used to think of the inline assembly more as asm volatile(“”): with an instruction inserted to extend it’s effect to beyond ptxas stage as well. But __threadfence_block() might have fewer effects on the cicc stage so switching between the two gives you an extra degree of freedom to influence code generation. And if they turn out to have the same effect, at least __threadfence_block() looks less appalling to people reading the code.

I also used to think that membar.cta is almost free. At least it was on CC 1.x, where it just compiled away apart from the code ordering enforcement. On higher CCs it has to wait for the cache so it’s probably somewhere between shmem latency and the latency of __syncthreads() (but only for that warp, not for the entire block!). Might be a latency - throughput tradeoff where reading clock() has lower latency but membar.cta / __threadfence_block() might have higher throughput.

Yes, i absolutely expect tiny and unrelated changes to possibly have a big influence on the generated code. Still in my case it proves that it is possible to generate better code. For a long time I had not pursued this problem anymore because I suspected there might be an undocumented property of the hardware preventing a speedup anyway. But now that I see that it is possible to generate better code, the motivation for investing the time to extract a testcase has jumped up again.

I was able to get 2x speedup from a trivial compute-intensive loop (running in a single warp) by reordering instructions and reducing effective instruction count. But this kind of compiler inefficiency matters a lot less when you’ve got tens of warps to keep feeding the ALUs. Memory loads are different, though. Last time I meddled with the PTX instruction prefetch I was convinced that prefetch was a fake instruction.

It’s been a year and half, looks like things haven’t improved much :(

I think it would be helpful if bugs were filed for such cases. Having applied lots of manual optimizations as well as worked closely with the compiler team from the very start of CUDA, I will caution that the incorporation of manually constructed trandformations into the compiler is often harder than one thinks, due to the following two scenarios:

(1) The manual optimization violates the C/C++ semantics in some aspect, or uses information not available to the compiler (such as knowledge about operand ranges). That is to say, the transformation is valid in specific contexts, but not generally.

(2) The manual transformation is valid under C/C++ semantics, but tends to help performance in some cases, while negatively impacting it in others. Unless a good heuristic (fairly accurate and cheap to compute from information available at compile time; compile time must be kept reasonable) can be constructed for triggering the transformation, knowledge of the transformation is not helpful.

Hi Hyqneuron, thanks for joining in!

I’ve never been able to achieve good results with prefetch either. It has been much more effective to keep the value in a register until it is used. The compiler seems to actively work against this though by needlessly moving the prefetched value between registers, producing bogus dependencies on the way. I really have to finish my bug report on that one.

With regard to arithmetic throughput my recent experiment seem to indicate that register banking as mentioned by Sylvain Collange in my other thread is the most important factor.
Seems like a Kepler assembler is really needed. :) Do you know if any work has started on CC 3.5?
If you have a concrete area where I might be of help I might spare some time. Although overall I’m not sure if it’s a good investment of time as this will always yield unsupported binaries and the situation seems much more difficult with Kepler’s encoded dependencies.

I wonder what would be necessary to get access to shader assembly. NDAs? Money? Prospect of large GPU sales?
Or since it worked with decuda, I wonder if a fully functional askepler would to the trick?

I would strongly encourage CUDA users to file bugs against the compiler if there is solid evidence of performance shortcomings inin generated code cluding register allocation, instruction selection, and instruction scheduling. Optimizing compilers are a collection of numerous passes controlle by heuristics (some with conflicting goals), and it is always possible that some particular use case is poorly served by the complex interaction between the various components.

Filed bug reports and feature requests help drive compiler improvements and prioritization of the compiler work. Ready-to-use repro codes are very helpful in that process and are much appreciated.

Hi Tera

I know some researchers have used NV’s internal assembler in the past, but I’m not sure if NV ever gives it out for other reasons. Even PGI doesn’t seem to be using the assembler. If you’re rich, you can always buy the full assembler and documentation from PathScale… the stuff in their github repo is not quite enough

As for askepler… I’ll try and ask around to get some sponsorship. If no one shows much interest, we could always wait for the next seriously motivated teenage boy :)