bit array on GPU

Hi, All,

I have an extremely large matrix that needs to be stored on global memory. The matrix consists of 0 and 1 . In order to minimize the memory usage, I am wondering whether it is possible to use only one bit to store each matrix element (like bit array)? And if it is possible, is that efficient in matrix calculation?

Any suggestions will be greatly appreciated!

Yes, you can use some other unsigned integer type and do the math to toggle/read an individual bit in a global array.
In my experience you probably are better off using the uchar1 type (8 bits, 1 byte) to store the bool value. This has been the fastest option ,assuming you can fit the bool matrix in global memory.

Using a bit array will be extremely efficient as you can perform a 32x32 matrix multiplication in registers in just 32 machine instructions, provided one of the matrices is available in transposed form.
Same goes for addition of two untransposed matrices, if overflow is guaranteed not to occur (as you say elements are only ever 0 or 1).

Transposition itself is a bit more expensive though.

You can play further tricks using __ballot(), although it’s application is limited as the returned word is the same for all threads of the warp.