How to pack predicate registers to regular register efficiently?

By reading the SASS code and the CUTLASS source code, I guess that each regular 32-bit register can hold 4 * 4 = 16 predicate information. However, I cannot access P2R/R2P in CUDA C/C++ nor in inline PTX.

And by using a bool array in C++, the NVCC seems always to use one register to store one predicate register (something like R2P R0, 0x02). Is that possible to give the compiler some hints to pack 4 predicate registers in one byte? Thanks!