Are there any native cuda primitives to efficiently cast uint8 mask (only the least significant bit is used to indicate true or false) to a bitmask without doing a loop?

input: a[32] of type uint8
output: b[4] of type uint32

i checked CUDA Math API :: CUDA Toolkit Documentation but didn’t find any

Maybe I am particularly dense today, but it is not clear what the desired operation does. So the input comprises 32 bytes `a[]`, each of which contains a boolean flag in `a[i]<0>`, i=0, …, 31. And these 32 bits are to be deposited nibble-wise in the 128 bits of `b[]`, such that:

`b[0]<0> = a[0]<0>`
`b[0]<1> = 0`
`b[0]<2> = 0`
`b[0]<3> = 0`
`b[0]<4> = a[1]<0>`
`b[0]<5> = 0`
`b[0]<6> = 0`
`b[0]<7> = 0`
`b[0]<8> = a[2]<0>`
[…]
`b[0]<28> = a[7]<0>`
`b[0]<29> = 0`
`b[0]<30> = 0`
`b[0]<31> = 0`
`b[1]<0> = a[8]<0>`
[…]

Correct?

What about the upper bits of the `a[i]`? Do we have `a[i]<7:1> == 0b0000000`, `a[i]<7:1> == 0b1111111`, or `a[i]<7:1> == 0bxxxxxxx`?

Are there any alignment guarantees for `a`? Does the input data have to be delivered as `uint8_t a[32]`, or could is be delivered as `uchar4 a[8]`, for example? The difference is in what alignment is guaranteed by CUDA for each type.