Texture fetch and predication

I have one device function, which looks like this :

__device__ float getValue(float inVal)
     float outVal = inVal;
     return outVal;

For a given execution, I know for a fact that someCondition is false as it comes straight from an input file. It is false all the time, for all threads.

I profile the code as-is (using nsight memory experiment) and then comment out the condition (so that the texture fetch does not get compiled). The profiler tells me that the texture load size is not equal when comparing both executions (by a significant margin), meaning that commenting out a texture fetch that is never reached as an impact on global bandwidth requirement. The overall application (which is of course much much bigger than this snippet) also gets slower/faster.

I understand that this is most likely due to predication, as the ‘if’ really is that small in the real code (address calculation + fetch), but I’m wondering if there is anything to do about it.

Edit : Cuda 4.2, Tesla C2070, 310.90 driver, windows 7 64 bits

As far as I know, texture fetch instructions are predicatable on sm_2x and later GPUs. You can easily check your hypothesis by inspecting the generated SASS (machine code) by disassembling the executable with cuobjdump --dump-sass. Tiny branch bodies like this are usually predicated and I have seen predicated texture fetch instructions used in real-life code. Predication lengthens basic blocks and thus gives the instruction scheduler more freedom to re-arrnge instructions for best performance.

I am bit surprised about your observations. If I understand correctly, your hypothesis is that the texture loads actually happen for those texture fetch instructions that are predicated off? An alternative hypothesis may be that the profiler counters are not properly accounting for the fact that some loads are predicated off, and that the speedup observed from manual code removal is due to reduced dynamic instruction count. I can’t think of a quick way of testing those hypotheses.

There is a uniform branch instruction, BRA.U in disassembly, that can be advantageously combined with predication, by skipping the predicated section of code outright when the branch condition is uniform across the branch, as is the case in your example. From looking at much code, it seems the compiler uses a heuristic based on the size of the branch body and generates BRA.U when the number of predicated instructions exceeds a certain limit (around 5+ instructions), considering that BRA.U itself also has a cost.

For code where I knew that BRA.U will (almost) never help, I managed to eliminate the BRA.U by splitting one if-statement into multiple ones. Note that this approach takes advantage of compiler artifacts that could change at the drop of a hat, so this is not something I recommend programmers do. The strategy for going the other way, i.e. enticing the compiler to use BRA.U with predication, would be to grow the size of branch body.

The uniform branch is exposed at the PTX level as bra.uni so the best local strategy I can think of is to code the entire if-statement as inline PTX assembly, using uniform branch plus predication. Depending on the nature of the branch condition, a higher-level strategy would be to template your kernel, and then implement multiple specializations, some with the texture fetches, others without.