__byte_perm with 16 bytes?

Hey experts,

I effectively need a __byte_perm instruction that operates on two 64 bit integers. I know there is no instruction for this but what would be the best substitution? I don’t mind writing ptx for this to get maximum performance :)

Example use case:

__byte_perm(x,y,s) where x, y are 64 bit ints.

x = 0xffeeddccbbaa9988
y = 0x7766554433221100
s = 0xf64d

output: 0xff6644dd

uint64 perm_64of64(uint64 x,y, uint32 s)
  return perm_32of64(x,y,s>>16) << 32  +  perm_32of64(x,y,uint16(s))

uint32 perm_32of64(uint64 x,y, uint16 s)
  a = __byte_perm(x>>32, uint32(x), f1(s))
  b = __byte_perm(y>>32, uint32(y), f2(s))
  return __byte_perm(a, b, f3(s))

Alternatively:

uint32 perm_32of64(uint64 x,y, uint16 s)
  a = __byte_perm(x>>32, y>>32, g1(s))
  b = __byte_perm(uint32(x), uint32(y), g2(s))
  return __byte_perm(a, b, g3(s))

Functions f1…g3 are left as exercise to the reader :D

This is the best I could come up with on the double. Note that the way you specified the function, the argument order for the less and more significant parts are reversed from __byte_perm: while it is __byte_perm (lo, hi, sel), it is select_4_from_16 (hi, lo, sel).

__device__ unsigned int select_4_from_16 (unsigned long long int hi, 
                                          unsigned long long int lo, 
                                          int sel)
{
    unsigned int h, l, hh, hl, lh, ll, m;

    /* split input into four 32-bit registers */
    asm ("mov.b64  {%0,%1},%2;" : "=r"(ll), "=r"(lh) : "l"(lo));
    asm ("mov.b64  {%0,%1},%2;" : "=r"(hl), "=r"(hh) : "l"(hi));

    /* select appropriate bytes in high and low */
    l = __byte_perm (ll, lh, sel); 
    h = __byte_perm (hl, hh, sel);

    /* build mask for selecting high or low */
    m = __byte_perm (0xffffffff, 0, sel >> 1);

    /* select extracted bytes according to mask */
    return (h & ~m) | (l & m);
}

Awesome thanks! :)