What are the limits on predication?

I have a conditional that looks like

if (x != A[y]) {
    B[y] += z;
    A[y] = x;
  1. Can predication handle this, or is it too large?

  2. How can I tell if the compiler was able to apply predication?
    (I have very little knowledge of assembler, so if it involves looking at assembler please be specific.)

It is easy to manually remove the branch in this case, if this is what you want.

int tmp = (x != A[y]);
B[y] = B[y] + tmp * z;
A[y] = (1-tmp) * A[y] + tmp * x;

You could take a look at this post https://devtalk.nvidia.com/default/topic/808438/cuda-programming-and-performance/predication-in-inline-ptx/
to see a quick example of a predication in assembly.

The CUDA compiler is not necessarily going to use predication to avoid a branch. It may also choose to use a select-type instruction or a conditional move. My impression is that in recent years it has been favoring the latter approach a bit more.

The way to find out what the compiler chose to do is to disassemble the generated machine code (SASS) with cuobjdump --dumpsass. It may be a bit challenging at first to match up relevant SASS with a particular portion of the source code.

There are limits to if-conversions, but they are not documented, are specific to GPU architecture, and may also change between compiler versions. Any such compiler heuristic will be based on an internal representation rather than source code. By observation one can roughly guess that if-conversion takes place for up to three machine instructions for a simple if-statement, and up to five machine instructions for an if-then-else. But those are rough estimates only, and they can change based on context, phase of moon, etc.

In general, CUDA programmers should not worry about minor local branching and write their code in a readable natural style. In my experience, the need to manually remove branches by “clever” computation arises very rarely, and it makes the code more difficult to read, incurring a technical debt.

Optimization efforts should be guided by the CUDA profiler, and in general you will probably find that data movement tends to be high on the list of bottlenecks, while branching is a minor concern.