I effectively need a __byte_perm instruction that operates on two 64 bit integers. I know there is no instruction for this but what would be the best substitution? I don’t mind writing ptx for this to get maximum performance :)
Example use case:
__byte_perm(x,y,s) where x, y are 64 bit ints.
x = 0xffeeddccbbaa9988
y = 0x7766554433221100
s = 0xf64d
This is the best I could come up with on the double. Note that the way you specified the function, the argument order for the less and more significant parts are reversed from __byte_perm: while it is __byte_perm (lo, hi, sel), it is select_4_from_16 (hi, lo, sel).
__device__ unsigned int select_4_from_16 (unsigned long long int hi,
unsigned long long int lo,
int sel)
{
unsigned int h, l, hh, hl, lh, ll, m;
/* split input into four 32-bit registers */
asm ("mov.b64 {%0,%1},%2;" : "=r"(ll), "=r"(lh) : "l"(lo));
asm ("mov.b64 {%0,%1},%2;" : "=r"(hl), "=r"(hh) : "l"(hi));
/* select appropriate bytes in high and low */
l = __byte_perm (ll, lh, sel);
h = __byte_perm (hl, hh, sel);
/* build mask for selecting high or low */
m = __byte_perm (0xffffffff, 0, sel >> 1);
/* select extracted bytes according to mask */
return (h & ~m) | (l & m);
}