# __byte_perm with 16 bytes?

Hey experts,

I effectively need a __byte_perm instruction that operates on two 64 bit integers. I know there is no instruction for this but what would be the best substitution? I don’t mind writing ptx for this to get maximum performance :)

Example use case:

__byte_perm(x,y,s) where x, y are 64 bit ints.

x = 0xffeeddccbbaa9988
y = 0x7766554433221100
s = 0xf64d

output: 0xff6644dd

``````uint64 perm_64of64(uint64 x,y, uint32 s)
return perm_32of64(x,y,s>>16) << 32  +  perm_32of64(x,y,uint16(s))

uint32 perm_32of64(uint64 x,y, uint16 s)
a = __byte_perm(x>>32, uint32(x), f1(s))
b = __byte_perm(y>>32, uint32(y), f2(s))
return __byte_perm(a, b, f3(s))

Alternatively:

uint32 perm_32of64(uint64 x,y, uint16 s)
a = __byte_perm(x>>32, y>>32, g1(s))
b = __byte_perm(uint32(x), uint32(y), g2(s))
return __byte_perm(a, b, g3(s))
``````

Functions f1…g3 are left as exercise to the reader :D

This is the best I could come up with on the double. Note that the way you specified the function, the argument order for the less and more significant parts are reversed from __byte_perm: while it is __byte_perm (lo, hi, sel), it is select_4_from_16 (hi, lo, sel).

``````__device__ unsigned int select_4_from_16 (unsigned long long int hi,
unsigned long long int lo,
int sel)
{
unsigned int h, l, hh, hl, lh, ll, m;

/* split input into four 32-bit registers */
asm ("mov.b64  {%0,%1},%2;" : "=r"(ll), "=r"(lh) : "l"(lo));
asm ("mov.b64  {%0,%1},%2;" : "=r"(hl), "=r"(hh) : "l"(hi));

/* select appropriate bytes in high and low */
l = __byte_perm (ll, lh, sel);
h = __byte_perm (hl, hh, sel);

/* build mask for selecting high or low */
m = __byte_perm (0xffffffff, 0, sel >> 1);

/* select extracted bytes according to mask */
return (h & ~m) | (l & m);
}
``````

Awesome thanks! :)