Predicate Register are part of the 8192 Regs of a mp or not


I’m wondering wether the predicate registers needed for e.g. if statements are taken from the standard 8192 available registers per multiprocessor? Or are there a special set of predicate registers?
If they are taken, a lot of if clauses spoil my register usage und has anyone managed to use only one predicate register for all if clauses?


Yes and no

  • G80 has four predicate “condition code” registers per thread. These are separate from the normal ones.
  • If you run out of these four (for some reason, for example nested conditions), normal registers are used (by juggling the contents between cc and normal registers)


So you say that there are 768*4 extra predicate registers!

If I have a problem with occupancy, predicate registers are not utilized?

AFAIK, register files do not support indexing or addressing, so its hard to coalesce them to regular registers, even by the driver. A predicate register is a single bit, so this doesn’t provide a lot of extra storage either.

In a program of mine I see $p1 to $p39 in the compiled ptx file . Which in my opinion means, that the if clauses need 39 predicate registers.

So first question:
My program only needs about 40 registers. As said above, 4 predicate come per thread the rest is taken from normal registers. So my kernel would then only use 5 registers (40 -39 + 4 ) for its computations? That cannot be true!

Is there a way to reuse the predicate registers? As the result is never needed again later on!



in ptx there is no re-use of registers, that step is performed in a later stage (conversion to GPU-specific cubin), so you will also see a lot of other registers being used in ptx.

I was wondering, since a predicate register is 1 bit, if it is swapped in place of a register(because we use more then 4) does it use a whole 32-bit register for each or does it use one bit at a time in a single register? (so one register would count for 32 predicates)

Is a predicate swap slow?

Thank you.

IIRC, the compiler stores each predicate in a 16-bit half-registers.

To be able to compact multiple predicates per register, you would need either to use a lot of predicates per thread, or use communication between threads through shared memory.

In the worst case, it just cost a mov between registers, so one instruction. It shouldn’t be significant.

Thank you for your help. I googled that and looked in the docs for an hour or so and could not find the info. I will not be so afraid of over using predicates now (except for watching the register usage) Thanks again.