how to pack bits/extract MSBs efficiently?

Hi everyone. I want to take 4 bytes and extract the MSBs (31, 23, 15, 7) and pack the as 4 consecutive bits. I want to do this so I can take a 4x4 block of uint8 samples and generate a 16bit lookup table index, which reduces costly computation and takes advantage of the GPU’s || memories.

It seems the GPU doesn’t have good bit manipulation instructions like _mm_movemask_epi8(). The best way I can think of is this:

packed   =   ((x >> 7) & 0x3) |  ((x >> 21) & 0xc)

but I would like something better since I have 4, 4byte ints to process.

which assumes all bytes are either 0xff or 0.

I don’t realy care about the order of the packed bits, but it would be preferable to make the 12 border bits contiguous.


Thanks. Do you know why it works? Apparently, multiply shifts the higher bits less than the lower bits. I remember there was some optimization discovered in the 1990s where divide by an int constant can be replaced with multiply by an int constant (GCC implements it, but I don’t think it’s as simple as multiply and shift right).

#include <iostream>

unsigned a[16]={

















bool passes(unsigned int v)


  unsigned z[16];

for (int i=0; i<16; i++) {


	for (int j=0; j<i; j++) if (z[i]==z[j]) return false;


  return true;


int main()


  for (unsigned int i=0; i<0xFFFFFFFE; i++)

	if (passes(i)) {

	  printf("Found multiplier: %u\n\n", i);

	  for (int j=0; j<16; j++) 

	printf("Test vector %08X produces index %d\n", a[j], (i*a[j])>>28);




~$ g++ dm.cpp

~$ ./a.out

Found multiplier: 1060977

Test vector 00000000 produces index 0

Test vector 000000FF produces index 1

Test vector 0000FF00 produces index 2

Test vector 0000FFFF produces index 3

Test vector 00FF0000 produces index 4

Test vector 00FF00FF produces index 5

Test vector 00FFFF00 produces index 6

Test vector 00FFFFFF produces index 7

Test vector FF000000 produces index 8

Test vector FF0000FF produces index 9

Test vector FF00FF00 produces index 10

Test vector FF00FFFF produces index 11

Test vector FFFF0000 produces index 12

Test vector FFFF00FF produces index 13

Test vector FFFFFF00 produces index 14

Test vector FFFFFFFF produces index 15

Could this approach be used to compare say bits 0-3 to bits 4-7 of the same word?

Or compare bits 0-3 in two different words?

The original question involved 6 logical operations. Is it really faster on current architectures with the “slow” 32 bit multiply. How many logical operations would it take before the multiply/shift would be faster?