Questions about __byte_perm(x,y,s)

What is __byte_perm(x,y,s)?

How does it really work?(Please with a small example?)

It’s a byte permute instruction - it picks four arbitrary bytes from two 32-bit values, based on indices that you provide.

See the CUDA 3.1 programming guide, p.131, and the PTX 2.1 documentation (prmt instruction).

Thank you your replay! But I still don’t understand it? What would for instance happen in the following example:



s=0, s=1, s=2…

Please help! Thanks.

Below is a mini CUDA app that demonstrates some of the applications of the __byte_perm() device function. The output of this program on my C2050 is:

[font=“Courier New”]vector splat

input: 7766554433221100

output: 00000000

output: 11111111

output: 22222222

output: 33333333

output: 44444444

output: 55555555

output: 66666666

output: 77777777

byte field extraction

input: 7766554433221100

output: 33221100

output: 44332211

output: 55443322

output: 66554433

output: 77665544

byte rotation

input: 33221100

output: 33221100

output: 00332211

output: 11003322

output: 22110033

byte reversal

input: 33221100

output: 00112233

byte gather

input: 7766554433221100

output: 77553311

output: 66442200[/font]

[codebox]#include <stdio.h>

#include <stdlib.h>

// Macro to catch CUDA errors in kernel launches


do { \

/* Check synchronous errors, i.e. pre-launch */                   \

cudaError_t err = cudaGetLastError();                             \

if (cudaSuccess != err) {                                         \

    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

             __FILE__, __LINE__, cudaGetErrorString(err) );       \

    exit(EXIT_FAILURE);                                           \

}                                                                 \

/* Check asynchronous errors, i.e. kernel failed (ULF) */         \

err = cudaThreadSynchronize();                                    \

if (cudaSuccess != err) {                                         \

    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

             __FILE__, __LINE__, cudaGetErrorString( err) );      \

    exit(EXIT_FAILURE);                                           \

}                                                                 \

} while (0)

global void byte_perm_examples (void)


unsigned int lo;

unsigned int hi;

unsigned int t;

lo = 0x33221100;

hi = 0x77665544;

printf ("\nvector splat\n");

printf ("------------\n");

printf ("input:  %08x%08x\n", hi, lo);

for (int i = 0; i < 8; i++) {

    t = __byte_perm (lo, hi, 0x1111 * i);

    printf ("output: %08x\n", t);


printf ("\nbyte field extraction\n");

printf ("---------------------\n");

printf ("input:  %08x%08x\n", hi, lo);

for (int i = 0; i < 5; i++) {

    t = __byte_perm (lo, hi, 0x3210 + 0x1111 * i);

    printf ("output: %08x\n", t);


printf ("\nbyte rotation\n");

printf ("-------------\n");

printf ("input:  %08x\n", lo);

for (int i = 0; i < 4; i++) {

    t = __byte_perm (lo, lo, 0x3210 + 0x1111 * i);

    printf ("output: %08x\n", t);


printf ("\nbyte reversal\n");

printf ("-------------\n");

printf ("input:  %08x\n", lo);

t = __byte_perm (lo, 0, 0x0123);

printf ("output: %08x\n", t);

printf ("\nbyte gather\n");

printf ("-----------\n");    

printf ("input:  %08x%08x\n", hi, lo);

t = __byte_perm (lo, hi, 0x7531);

printf ("output: %08x\n", t);

t = __byte_perm (lo, hi, 0x6420);

printf ("output: %08x\n", t);


int main (void)






njuffa, thank you very much for your answer! This is quite interesting! Unfortunately, I am having some Driver-problems right now. But thank you!


sorry for reviving a 4 year old thread.

I was wondering if the __byte_perm intrinsics map to hardware instructions on all Compute architectures and what the peak throughput of __byte_perm might be…

We were considering tp use this instruction for emulating 16 bit rotations of uint64_t states in the implementation of a cryptographic hash function.


While the __byte_perm() intrinsic is supported across all compute capabilities, it has hardware support only from Fermi on, i.e. compute capability 2.0. I do not know the exact throughput across Fermi, Kepler, and Maxwell but am reasonably sure that it is never less than then the throughput of integer shifts. If your shift factors are compile-time constants, I would let the compiler worry about converting shifts to PRMT instructions as appropriate. Depending on what pipe(s) shifts and PRMT are assigned to, it might even make sense to use a mix of shifts and PRMTs.

Note that the __byte_perm() intrinsic is a simple abstraction of the PRMT instruction designed to be as similar as possible to permutation intrinsics on other platforms. The actual PRMT instruction, which is accessible via inline PTX, has additional functionally (notably a sign extension feature that I put to good use in the emulation of the SIMD-in-a-word functions) that is masked off in __byte_perm(). As a consequence, calling __byte_perm() with a non-constant selector value will emit a two-instruction sequence LOP.AND, PRMT. The LOP.AND is optimized out when the selector value is a compile-time constant.

For my experience, PRMT has the same throughput of a single shift instruction.
PRMT is a good candidate for a bit rotation when the selector values is a compile-time constant multiple of 8 on Fermi and Kepler 3.0. In these cases,
the compiler (CUDA 5.5) doesn’t automatically merge the two shifts (a << n | a >> (32 - n)) into a single byte permute.

In other words, the __byte_perm(x,0,0x1032) rotation,
will have double the throughput of (x << 16) + (x >> 16).

Good point on byte-wise rotates vs byte-wise shifts. What I observed in terms of optimization is the conversion of byte-wise shift to PRMT, although I don’t recall on which platform.

The compiler implements idiom recognition for the common rotate idiom you show, and on sm_35 and later architectures the rotate idiom should translate into a single funnel shift (SHF) instruction, which likely means that using PRMT won’t lead to any further speedups. Last I checked the compiler was correctly applying that transformation.

Based on my work on the emulation code for the SIMD-in-a-word instructions I would hazard a guess that the most efficient code sequences could well differ from architecture to architecture, as raw function unit throughput is just one aspect. Overall execution pipe utilization in code context is another aspect. I usually experiment with multiple alternative implementation styles to find the most appropriate one for each architecture, but I realize this takes time that may not always be available.