Hi all,
I have one question. How many predicate registers (e.g., P0, P1) a thread has?
Will the use of predicate registers add pressure to thread registers (R[0-255])?
Will the number of predicate registers I use influence occupancy?
Thanks!
Hi all,
I have one question. How many predicate registers (e.g., P0, P1) a thread has?
Will the use of predicate registers add pressure to thread registers (R[0-255])?
Will the number of predicate registers I use influence occupancy?
Thanks!
(1) I believe I have seen up to six predicates being used. The number available may vary by GPU architecture, I am not aware of official documentation that specifies the number of predicates.
(2) I believe I have seen the compiler spill predicates into general purposes registers when running out of predicates. Look for instructions P2R and R2P, although my memory is hazy on this and I could be wrong.
(3) If there are predicates spilling into R-registers, there may be influence on occupancy due to increased R-register usage.
Based on my experience, your questions focus on a very unlikely (i.e. largely hypothetical) corner case. Use of predicate registers is not typically something CUDA programmers should be worried about. Instead, use the CUDA profiler to identify actual bottlenecks.
Thanks! Your answer is really helpful and I also observed (1) & (2).
I’m doing research related to CUDA (e.g., push GPU performance to it’s limit; find new applications that are suitable for GPGPU). So my questions may be weird. Hope you can understand.
I think there are 7 predicates per thread, P0~P6. Actually in sass ISA, the predication is encoded with 4bits, 3 for specifying the predicate register (000~110, and 111 is for no predication), and 1bit for the optional NOT operation of predicates (such as @!P0).
Presumably that is actually the predefined predicate “True”, which shows up as “PT” in SASS disassembly. That is a real predicate, as one can see from those instructions that support predicate logic (typically used in the implementation of compound branch conditions).
Thanks for pointing out this.
Maybe the way to indexing predicate register depends on architecture.
You can check chapter 2.3 of this:
https://arxiv.org/pdf/1804.06826.pdf
It’s said that
Predication is regulated by 4 bits: the first bit
is a negation flag, and the remaining 3 bits encode a predicate register index.
You can use [url=https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#nvdisasm-usage]nvdisasm -plr[/url]
to see predicate register usage.
That’s cool!
And I think I need a wider monitor