Bit level processing

I have a binary data…is there an economical way i can do byte stufing and store 8 bytes into one byte…as bit level processing gonna improve my speed…?

Sandarbh, would you please elaborate the above further (storing 8 byte in one Byte)?

P.S: I have sent you PM, please check your messages

See…as i have binary data…instead of storing eight bytes I can store the eight data elements into one byte…and process it at one go…instead of operating on the eight bytes individually…i can do it in one step…saving me about one eight time…the main idea is how to save the 8 data in one byte in a fast manner… hope you get it…

Basically, he’s saying that he has binary data (literally, 1-bit data points) and wants to pack 8 data-bits into a single data-byte. The easiest way I know to do this would be to just make a couple of little macros that use bitmasks to set or get the bit you want from the byte. However, if you’re going through that trouble (and if it makes sense for your problem), I’d at least use a 32-bit integer (if not even a uint4 vector (128 bits) so that you get the best memory access performance per thread).

And yes, sandarbh, it’ll improve your speed if you’re operating on a lot of these data points, since you’ll be able to process more data for each memory access that you have to make.


I get that part that i have to mask bits to work with them…But…initially i hav a single binary data in a byte…how can i parallely store multiple data into one byte…?

Thank You for your help.

You probably want to pack them in a preprocessing step. Assuming that your initial values are either 1 or 0, then you can make a byte out of 8 bits like so:

uchar compressedByte = data0|(data1<<1)|(data2<<2)|(data3<<3)|(data4<<4)|(data5<<5)|(data6<<6)|(data7<<7);

You could also replace the or ‘|’ with addition ‘+’ and have the same effect.

Another interesting way to do it is to have your data values being 0 and -1, which is 0x00 and 0xFF in hexidecimal. Then you can construct the byte like this:

uchar compressedByte = (data0&0x01)|(data1&0x02)|(data2&0x04)|(data3&0x08)|(data4&0x10)|(data5&0x20)|(data6&0x40)|(data7&0x80);

If you have your initial data in an array, it might be easiest to do something like this:

uchar sourceBits[arrayLength];	//NOTE:  pad this array so that the length is divisible by 32.

uint bitsPackedIntoUints[arrayLength>>5];

for (int n = 0; n < (arrayLength>>5); n++)


	bitsPackedIntoUints[n] = 0;

	for (int m = 0; m < 32; m++)


		bitsPackedIntoUints[n] |= sourceBits[(n<<5)+m]<<m;



Unless you’re generating the initial values on the GPU, it’s best to do this preprocessing on the CPU side. The reason for this is that the CPU can finish a simple operation like this in less time that it would take to send the uncompressed data over the PCIe bus, which is the slowest component in the system other than the hard drive. Remember, the compressed data is 8X smaller!

If the data was generated on the GPU in the first place, then obviously the GPU should compress it. In this case, just replace the outer for with the threadID, and launch a grid with arrayLength>>5 threads. You can’t divide the inner loop between threads because only one thread can write to a value at a time, any additional writes are lost. Remember, |= is a read-modify-write operation.