What is __byte_perm(x,y,s)?
How does it really work?(Please with a small example?)
What is __byte_perm(x,y,s)?
How does it really work?(Please with a small example?)
It’s a byte permute instruction - it picks four arbitrary bytes from two 32-bit values, based on indices that you provide.
See the CUDA 3.1 programming guide, p.131, and the PTX 2.1 documentation (prmt instruction).
Thank you your replay! But I still don’t understand it? What would for instance happen in the following example:
x=0xff00;
y=0x00ff;
s=0, s=1, s=2…
Please help! Thanks.
Below is a mini CUDA app that demonstrates some of the applications of the __byte_perm() device function. The output of this program on my C2050 is:
[font=“Courier New”]vector splat
input: 7766554433221100
output: 00000000
output: 11111111
output: 22222222
output: 33333333
output: 44444444
output: 55555555
output: 66666666
output: 77777777
byte field extraction
input: 7766554433221100
output: 33221100
output: 44332211
output: 55443322
output: 66554433
output: 77665544
byte rotation
input: 33221100
output: 33221100
output: 00332211
output: 11003322
output: 22110033
byte reversal
input: 33221100
output: 00112233
byte gather
input: 7766554433221100
output: 77553311
output: 66442200[/font]
[codebox]#include <stdio.h>
#include <stdlib.h>
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaThreadSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
global void byte_perm_examples (void)
{
unsigned int lo;
unsigned int hi;
unsigned int t;
lo = 0x33221100;
hi = 0x77665544;
printf (“\nvector splat\n”);
printf ("------------\n");
printf ("input: %08x%08x\n", hi, lo);
for (int i = 0; i < 8; i++) {
t = __byte_perm (lo, hi, 0x1111 * i);
printf ("output: %08x\n", t);
}
printf (“\nbyte field extraction\n”);
printf ("---------------------\n");
printf ("input: %08x%08x\n", hi, lo);
for (int i = 0; i < 5; i++) {
t = __byte_perm (lo, hi, 0x3210 + 0x1111 * i);
printf ("output: %08x\n", t);
}
printf (“\nbyte rotation\n”);
printf ("-------------\n");
printf ("input: %08x\n", lo);
for (int i = 0; i < 4; i++) {
t = __byte_perm (lo, lo, 0x3210 + 0x1111 * i);
printf ("output: %08x\n", t);
}
printf (“\nbyte reversal\n”);
printf ("-------------\n");
printf ("input: %08x\n", lo);
t = __byte_perm (lo, 0, 0x0123);
printf ("output: %08x\n", t);
printf (“\nbyte gather\n”);
printf ("-----------\n");
printf ("input: %08x%08x\n", hi, lo);
t = __byte_perm (lo, hi, 0x7531);
printf ("output: %08x\n", t);
t = __byte_perm (lo, hi, 0x6420);
printf ("output: %08x\n", t);
}
int main (void)
{
byte_perm_examples<<<1,1>>>();
CHECK_LAUNCH_ERROR();
return EXIT_SUCCESS;
}[/codebox]
njuffa, thank you very much for your answer! This is quite interesting! Unfortunately, I am having some Driver-problems right now. But thank you!
Hi,
sorry for reviving a 4 year old thread.
I was wondering if the __byte_perm intrinsics map to hardware instructions on all Compute architectures and what the peak throughput of __byte_perm might be…
We were considering tp use this instruction for emulating 16 bit rotations of uint64_t states in the implementation of a cryptographic hash function.
Christian
While the __byte_perm() intrinsic is supported across all compute capabilities, it has hardware support only from Fermi on, i.e. compute capability 2.0. I do not know the exact throughput across Fermi, Kepler, and Maxwell but am reasonably sure that it is never less than then the throughput of integer shifts. If your shift factors are compile-time constants, I would let the compiler worry about converting shifts to PRMT instructions as appropriate. Depending on what pipe(s) shifts and PRMT are assigned to, it might even make sense to use a mix of shifts and PRMTs.
Note that the __byte_perm() intrinsic is a simple abstraction of the PRMT instruction designed to be as similar as possible to permutation intrinsics on other platforms. The actual PRMT instruction, which is accessible via inline PTX, has additional functionally (notably a sign extension feature that I put to good use in the emulation of the SIMD-in-a-word functions) that is masked off in __byte_perm(). As a consequence, calling __byte_perm() with a non-constant selector value will emit a two-instruction sequence LOP.AND, PRMT. The LOP.AND is optimized out when the selector value is a compile-time constant.
For my experience, PRMT has the same throughput of a single shift instruction.
PRMT is a good candidate for a bit rotation when the selector values is a compile-time constant multiple of 8 on Fermi and Kepler 3.0. In these cases,
the compiler (CUDA 5.5) doesn’t automatically merge the two shifts (a << n | a >> (32 - n)) into a single byte permute.
In other words, the __byte_perm(x,0,0x1032) rotation,
will have double the throughput of (x << 16) + (x >> 16).
Good point on byte-wise rotates vs byte-wise shifts. What I observed in terms of optimization is the conversion of byte-wise shift to PRMT, although I don’t recall on which platform.
The compiler implements idiom recognition for the common rotate idiom you show, and on sm_35 and later architectures the rotate idiom should translate into a single funnel shift (SHF) instruction, which likely means that using PRMT won’t lead to any further speedups. Last I checked the compiler was correctly applying that transformation.
Based on my work on the emulation code for the SIMD-in-a-word instructions I would hazard a guess that the most efficient code sequences could well differ from architecture to architecture, as raw function unit throughput is just one aspect. Overall execution pipe utilization in code context is another aspect. I usually experiment with multiple alternative implementation styles to find the most appropriate one for each architecture, but I realize this takes time that may not always be available.