Are there any native cuda primitives to efficiently cast uint8 mask (only the least significant bit is used to indicate true or false) to a bitmask without doing a loop?
input: a[32] of type uint8
output: b[4] of type uint32
Maybe I am particularly dense today, but it is not clear what the desired operation does. So the input comprises 32 bytes a[], each of which contains a boolean flag in a[i]<0>, i=0, …, 31. And these 32 bits are to be deposited nibble-wise in the 128 bits of b[], such that:
What about the upper bits of the a[i]? Do we have a[i]<7:1> == 0b0000000, a[i]<7:1> == 0b1111111, or a[i]<7:1> == 0bxxxxxxx?
Are there any alignment guarantees for a? Does the input data have to be delivered as uint8_t a[32], or could is be delivered as uchar4 a[8], for example? The difference is in what alignment is guaranteed by CUDA for each type.