Is there a way to control instruction ordering? (and what is the difference between TEX.T and TEX.P?

I’ve been using cuobjdump -sass to look at the disassembly of a kernel. The kernel contains a lot of unrolled loops and tex2D instructions. The compiler tends to use a lot of registers for this kernel and looking at the disassembly it seems to be doing this primarily so that it can execute the tex2D instructions in groups of four (three TEX.T followed by one TEX.P). Additionally it tries to leave several of these groups pending at each TEXDEPBAR instruction. If I use launch bounds to reduce the register usage (but not to the point of spilling registers) then it tends to become a mess with tex2D instructions executed individually or in groups of two or three (and also not in the same order as in the original C or PTX). This kernel runs slower despite having higher occupancy. Is there something special about groups of four tex2D instructions (is it required in order to achieve the quoted texturing rate for example?) and what exactly is the difference between TEX.T and TEX.P anyway? I’m pretty sure that if I could manually allocate physical registers then I could do a better job of scheduling these tex2D instructions. I looked at asfermi but apparently it doesn’t work properly on Kepler because it can’t generate the necessary scheduling information. Is there anything I can do at the C or PTX level to guide the compiler/assembler instead?

What you observe is called “load batching”. This is a machine-specific optimization that the CUDA compiler applies to improve the performance of memory intensive code. As far as I know, the load batching process tries to maximize the utilization of internal finite-length queues, whose sizes differ between different GPU architectures.

As you have found, there is a trade-off with register pressure, in that increasing the batch size requires more temporary register storage. The compiler uses heuristics to determine the “optimal” balance between batch size and register pressure (and thus occupancy). You have also already found that you can use __launch_bounds() to shift the balance in the direction of reduced register usage. Unless the increased occupancy can make up for the loss of efficiency inside the memory hierarchy, this will result in reduced performance, so what you are seeing is typical for memory intensive code that already have reasonable occupancy.

You can influence loop unrolling with #pragma unroll immediately prior to a loop. In my experience, the tool chain in recent versions of CUDA makes good decisions about loop unrolling, so that I have not tried to interfere manually in the past couple of years.

There is probably not enough detail provided by the public architecture description to achieve better performance by manual optimization at the SASS (machine code) level, even if a SASS assembler were available for individual GPU architectures. Note that SASS is architecture specific and not portable.

Thanks, njuffa.

I think that I might be able to achieve the effect I want by using #pragma unroll 1 and manually batching the tex2D instructions within the loop iteration. The problem with this is that I will then be wasting registers on the loop counter and addresses, etc. derived from it. Also, I’m currently using different kernel parameters in each loop iteration and I’m not sure if I can do this without incurring an overhead if my loop is not fully unrolled.

Basically, I want the loop to be unrolled in the sense that it has a fixed number of iterations and has values which can be determined at compile-time but I still want the implicit register re-use and ordering of instructions that comes with a loop.

I think the thing that concerns me most is the fact that my tex2D() instructions are now in the wrong order. The original order was carefully chosen to maximize cache hit rates.

I am not sure how you determined that TEX instructions are issued in “wrong” order by the compiler and that a different ordering of these instructions would improve overall performance of the code.

If you have repro code that can demonstrate this, I would suggest filing a bug against the compiler.

Well, its difficult to prove that a different ordering would improve the performance without being able to choose a specific ordering. However, the original idea was for iteration N of the loop to make M calls to tex2D() using texture N. I now have calls from iteration N+1 (operating on texture N+1) literally interleaved in some cases with calls from iteration N (operating on texture N). This doesn’t seem like a good idea. Of course there is more to the story because its possible that other warps will already have progressed to iteration N+1 anyway but that is something I definitely can’t control (without using __syncthreads() anyway).

Okay, so I’ve added


to the end of my loop (sync is a kernel parameter which happens to always be 0).

This seems to successfully prevent the compiler from mixing up calls from different iterations. I’d be happier if there was some kind of #pragma that achieved the same thing though.

This looks very familiar to me - have you seen my thread from last year?
Unfortunately I have no solution to offer, as I have been working on unrelated projects since and not had time to pursue the issue further. I’d really like to help with getting this sorted though.

Actually, I think maybe I did read that thread and then forgot about it…

I’d be interested to hear what people think is the lowest cost way to implement a barrier of this type. In my solution the condition gets evaluated just once and the result just sits there in a predicate register for the whole kernel. Since it never actually executes the __syncthreads() I think this must be pretty cheap. If there is an advantage to using something else (maybe an asm of some kind) instead of the __syncthreads() then I’d like to know.

The end result, by the way, after I rearranged the contents of my loop to make sure that I had as many tex2D() calls as possible pending when it hits the barrier was about an 8% improvement over the compiler’s best efforts.

I’d think your barrier is pretty optimal if you can spare one condition register for the entire kernel.
(Unless of course there is an undocumented PTX barrier pseudoinstruction that compiles to zero instructions).