how to pack bits/extract MSBs efficiently?

Uncle_Joe · February 2, 2010, 4:37am

Hi everyone. I want to take 4 bytes and extract the MSBs (31, 23, 15, 7) and pack the as 4 consecutive bits. I want to do this so I can take a 4x4 block of uint8 samples and generate a 16bit lookup table index, which reduces costly computation and takes advantage of the GPU’s || memories.

It seems the GPU doesn’t have good bit manipulation instructions like _mm_movemask_epi8(). The best way I can think of is this:

packed   =   ((x >> 7) & 0x3) |  ((x >> 21) & 0xc)

but I would like something better since I have 4, 4byte ints to process.

which assumes all bytes are either 0xff or 0.

I don’t realy care about the order of the packed bits, but it would be preferable to make the 12 border bits contiguous.

SPWorley · February 2, 2010, 6:22am

(v*1060977)>>28

Uncle_Joe · February 2, 2010, 5:04pm

Thanks. Do you know why it works? Apparently, multiply shifts the higher bits less than the lower bits. I remember there was some optimization discovered in the 1990s where divide by an int constant can be replaced with multiply by an int constant (GCC implements it, but I don’t think it’s as simple as multiply and shift right).

SPWorley · February 2, 2010, 5:33pm

#include <iostream>

unsigned a[16]={

  0x00000000,

  0x000000FF,

  0x0000FF00,

  0x0000FFFF,

  0x00FF0000,

  0x00FF00FF,

  0x00FFFF00,

  0x00FFFFFF,

  0xFF000000,

  0xFF0000FF,

  0xFF00FF00,

  0xFF00FFFF,

  0xFFFF0000,

  0xFFFF00FF,

  0xFFFFFF00,

  0xFFFFFFFF};

bool passes(unsigned int v)

{

  unsigned z[16];

for (int i=0; i<16; i++) {

	z[i]=(a[i]*v)>>28;

	for (int j=0; j<i; j++) if (z[i]==z[j]) return false;

  }

  return true;

}

int main()

{

  for (unsigned int i=0; i<0xFFFFFFFE; i++)

	if (passes(i)) {

	  printf("Found multiplier: %u\n\n", i);

	  for (int j=0; j<16; j++) 

	printf("Test vector %08X produces index %d\n", a[j], (i*a[j])>>28);

	  break;

	}

}

~$ g++ dm.cpp

~$ ./a.out

Found multiplier: 1060977

Test vector 00000000 produces index 0

Test vector 000000FF produces index 1

Test vector 0000FF00 produces index 2

Test vector 0000FFFF produces index 3

Test vector 00FF0000 produces index 4

Test vector 00FF00FF produces index 5

Test vector 00FFFF00 produces index 6

Test vector 00FFFFFF produces index 7

Test vector FF000000 produces index 8

Test vector FF0000FF produces index 9

Test vector FF00FF00 produces index 10

Test vector FF00FFFF produces index 11

Test vector FFFF0000 produces index 12

Test vector FFFF00FF produces index 13

Test vector FFFFFF00 produces index 14

Test vector FFFFFFFF produces index 15

spurious_interrupt · February 2, 2010, 6:02pm

Could this approach be used to compare say bits 0-3 to bits 4-7 of the same word?

Or compare bits 0-3 in two different words?

The original question involved 6 logical operations. Is it really faster on current architectures with the “slow” 32 bit multiply. How many logical operations would it take before the multiply/shift would be faster?

Topic		Replies	Views
fastest way to access the 8 MSB of an (unsigned) int? CUDA Programming and Performance	2	1091	December 30, 2009
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8363	March 1, 2015
How much speed of 64bit integer algebra in the latest GPUs? CUDA Programming and Performance	2	2050	April 21, 2014
why shift is slower than integer multiply shift ,integer multiply CUDA Programming and Performance	20	5622	July 1, 2010
Compiler bug? CUDA Programming and Performance	2	1165	July 24, 2009
PTX,... does comparing a bit either a 0 or 1 take 64 bits? CUDA Programming and Performance	3	491	April 13, 2018
Bit level processing CUDA Programming and Performance	5	5113	December 26, 2009
Bytes manipulation in PTX CUDA Programming and Performance hw , cuda , kernel	23	2799	March 2, 2023
Multiplication gives 24bit results. How to get 32? CUDA Programming and Performance	11	3477	March 1, 2009
Parallelizing multiplication CUDA Programming and Performance	14	7507	October 11, 2008

how to pack bits/extract MSBs efficiently?

Related topics