Questions about __byte_perm(x,y,s)

anfaenger · July 20, 2010, 1:00pm

What is __byte_perm(x,y,s)?

How does it really work?(Please with a small example?)

Simon_Green · July 20, 2010, 1:19pm

It’s a byte permute instruction - it picks four arbitrary bytes from two 32-bit values, based on indices that you provide.

See the CUDA 3.1 programming guide, p.131, and the PTX 2.1 documentation (prmt instruction).

anfaenger · July 21, 2010, 9:05am

Thank you your replay! But I still don’t understand it? What would for instance happen in the following example:

x=0xff00;

y=0x00ff;

s=0, s=1, s=2…

Please help! Thanks.

njuffa · July 31, 2010, 1:59am

Below is a mini CUDA app that demonstrates some of the applications of the __byte_perm() device function. The output of this program on my C2050 is:

[font=“Courier New”]vector splat

input: 7766554433221100

output: 00000000

output: 11111111

output: 22222222

output: 33333333

output: 44444444

output: 55555555

output: 66666666

output: 77777777

byte field extraction

input: 7766554433221100

output: 33221100

output: 44332211

output: 55443322

output: 66554433

output: 77665544

byte rotation

input: 33221100

output: 33221100

output: 00332211

output: 11003322

output: 22110033

byte reversal

input: 33221100

output: 00112233

byte gather

input: 7766554433221100

output: 77553311

output: 66442200[/font]

[codebox]#include <stdio.h>

#include <stdlib.h>

// Macro to catch CUDA errors in kernel launches

#define CHECK_LAUNCH_ERROR() \

do { \

/* Check synchronous errors, i.e. pre-launch */                   \

cudaError_t err = cudaGetLastError();                             \

if (cudaSuccess != err) {                                         \

    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

             __FILE__, __LINE__, cudaGetErrorString(err) );       \

    exit(EXIT_FAILURE);                                           \

}                                                                 \

/* Check asynchronous errors, i.e. kernel failed (ULF) */         \

err = cudaThreadSynchronize();                                    \

if (cudaSuccess != err) {                                         \

    fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

             __FILE__, __LINE__, cudaGetErrorString( err) );      \

    exit(EXIT_FAILURE);                                           \

}                                                                 \

} while (0)

global void byte_perm_examples (void)

{

unsigned int lo;

unsigned int hi;

unsigned int t;

lo = 0x33221100;

hi = 0x77665544;

printf (“\nvector splat\n”);

printf ("------------\n");

printf ("input:  %08x%08x\n", hi, lo);

for (int i = 0; i < 8; i++) {

    t = __byte_perm (lo, hi, 0x1111 * i);

    printf ("output: %08x\n", t);

}

printf (“\nbyte field extraction\n”);

printf ("---------------------\n");

printf ("input:  %08x%08x\n", hi, lo);

for (int i = 0; i < 5; i++) {

    t = __byte_perm (lo, hi, 0x3210 + 0x1111 * i);

    printf ("output: %08x\n", t);

}

printf (“\nbyte rotation\n”);

printf ("-------------\n");

printf ("input:  %08x\n", lo);

for (int i = 0; i < 4; i++) {

    t = __byte_perm (lo, lo, 0x3210 + 0x1111 * i);

    printf ("output: %08x\n", t);

}

printf (“\nbyte reversal\n”);

printf ("-------------\n");

printf ("input:  %08x\n", lo);

t = __byte_perm (lo, 0, 0x0123);

printf ("output: %08x\n", t);

printf (“\nbyte gather\n”);

printf ("-----------\n");    

printf ("input:  %08x%08x\n", hi, lo);

t = __byte_perm (lo, hi, 0x7531);

printf ("output: %08x\n", t);

t = __byte_perm (lo, hi, 0x6420);

printf ("output: %08x\n", t);

}

int main (void)

{

byte_perm_examples<<<1,1>>>();

CHECK_LAUNCH_ERROR();

return EXIT_SUCCESS;

}[/codebox]

anfaenger · August 2, 2010, 8:32am

njuffa, thank you very much for your answer! This is quite interesting! Unfortunately, I am having some Driver-problems right now. But thank you!

cbuchner1 · April 10, 2014, 9:22am

Hi,

sorry for reviving a 4 year old thread.

I was wondering if the __byte_perm intrinsics map to hardware instructions on all Compute architectures and what the peak throughput of __byte_perm might be…

We were considering tp use this instruction for emulating 16 bit rotations of uint64_t states in the implementation of a cryptographic hash function.

Christian

njuffa · April 10, 2014, 3:11pm

While the __byte_perm() intrinsic is supported across all compute capabilities, it has hardware support only from Fermi on, i.e. compute capability 2.0. I do not know the exact throughput across Fermi, Kepler, and Maxwell but am reasonably sure that it is never less than then the throughput of integer shifts. If your shift factors are compile-time constants, I would let the compiler worry about converting shifts to PRMT instructions as appropriate. Depending on what pipe(s) shifts and PRMT are assigned to, it might even make sense to use a mix of shifts and PRMTs.

Note that the __byte_perm() intrinsic is a simple abstraction of the PRMT instruction designed to be as similar as possible to permutation intrinsics on other platforms. The actual PRMT instruction, which is accessible via inline PTX, has additional functionally (notably a sign extension feature that I put to good use in the emulation of the SIMD-in-a-word functions) that is masked off in __byte_perm(). As a consequence, calling __byte_perm() with a non-constant selector value will emit a two-instruction sequence LOP.AND, PRMT. The LOP.AND is optimized out when the selector value is a compile-time constant.

DavideBarbieri · April 10, 2014, 8:32pm

For my experience, PRMT has the same throughput of a single shift instruction.
PRMT is a good candidate for a bit rotation when the selector values is a compile-time constant multiple of 8 on Fermi and Kepler 3.0. In these cases,
the compiler (CUDA 5.5) doesn’t automatically merge the two shifts (a << n | a >> (32 - n)) into a single byte permute.

In other words, the __byte_perm(x,0,0x1032) rotation,
will have double the throughput of (x << 16) + (x >> 16).

njuffa · April 10, 2014, 9:21pm

Good point on byte-wise rotates vs byte-wise shifts. What I observed in terms of optimization is the conversion of byte-wise shift to PRMT, although I don’t recall on which platform.

The compiler implements idiom recognition for the common rotate idiom you show, and on sm_35 and later architectures the rotate idiom should translate into a single funnel shift (SHF) instruction, which likely means that using PRMT won’t lead to any further speedups. Last I checked the compiler was correctly applying that transformation.

Based on my work on the emulation code for the SIMD-in-a-word instructions I would hazard a guess that the most efficient code sequences could well differ from architecture to architecture, as raw function unit throughput is just one aspect. Overall execution pipe utilization in code context is another aspect. I usually experiment with multiple alternative implementation styles to find the most appropriate one for each architecture, but I realize this takes time that may not always be available.

Topic		Replies	Views
__byte_perm with 16 bytes? CUDA Programming and Performance	3	1638	June 10, 2016
How to workaround suboptimal code for __byte_perm intrinsic? CUDA Programming and Performance	5	2149	August 26, 2011
Bytes manipulation in PTX CUDA Programming and Performance hw , cuda , kernel	23	3389	March 2, 2023
__byte_perm glsl OpenGL	0	681	January 7, 2018
weird mask generation before prmt CUDA Programming and Performance	1	978	May 15, 2011
Toolkit 3.2 goodies CUDA Programming and Performance	7	10137	January 3, 2011
Can CUDA do permutations CUDA Programming and Performance	11	12843	March 30, 2012
non recursive permute CUDA Programming and Performance	4	1180	October 13, 2014
Pipeline operator forwarding for integer instructions in CUDA CUDA Programming and Performance cuda , kernel	25	646	July 15, 2024
__clz and Endianess for reading unary code CUDA Programming and Performance	5	6195	July 9, 2011

Questions about __byte_perm(x,y,s)

Related topics