Uint8 mask to bitmask

Are there any native cuda primitives to efficiently cast uint8 mask (only the least significant bit is used to indicate true or false) to a bitmask without doing a loop?

input: a[32] of type uint8
output: b[4] of type uint32

i checked CUDA Math API :: CUDA Toolkit Documentation but didn’t find any

Maybe I am particularly dense today, but it is not clear what the desired operation does. So the input comprises 32 bytes a[], each of which contains a boolean flag in a[i]<0>, i=0, …, 31. And these 32 bits are to be deposited nibble-wise in the 128 bits of b[], such that:

b[0]<0> = a[0]<0>
b[0]<1> = 0
b[0]<2> = 0
b[0]<3> = 0
b[0]<4> = a[1]<0>
b[0]<5> = 0
b[0]<6> = 0
b[0]<7> = 0
b[0]<8> = a[2]<0>
[…]
b[0]<28> = a[7]<0>
b[0]<29> = 0
b[0]<30> = 0
b[0]<31> = 0
b[1]<0> = a[8]<0>
[…]

Correct?

What about the upper bits of the a[i]? Do we have a[i]<7:1> == 0b0000000, a[i]<7:1> == 0b1111111, or a[i]<7:1> == 0bxxxxxxx?

Are there any alignment guarantees for a? Does the input data have to be delivered as uint8_t a[32], or could is be delivered as uchar4 a[8], for example? The difference is in what alignment is guaranteed by CUDA for each type.