Bytes manipulation in PTX


I was searching in the PTX manual for byte manipulation functions, but I didn’t find what I was looking for.

Basically I have 32-bit words that represents four bytes and I would like to read/write bytes individually. I was thinking about using bfi and bfe. I’m surprised that I didn’t find any specialised instruction. Is this the best way to pack the bytes or I’m missing something? Maybe cvt.pack? but it takes only two inputs.

Currently I’m using bit masks, but the performance could be better I’m sure.

Thanks a lot in advance,



permute may be of interest

shift may be of interest.

1 Like

I would like to read/write bytes individually.

In the vast majority of cases, this not a good idea if you are after high performance. If you use the CUDA profiler (which I heartily recommend), you will likely find that the cost of extracting bytes from a uint32_t via mask & shift is negligible and not the bottleneck in the code.

When combining bytes into a uint32_t it might be marginally better to use addition instead of OR-ing, depending on the architecture (e.g. ISCADD is left-shift plus add in one instruction, while there is usually no left-shift-plus-OR instruction).

In my recollection neither BFI nor BFE have particularly high throughput.

1 Like

Thanks a lot for your answers. I know that this is a bad idea, but the problem has a part based on dynamic-index byte addressing. I had to manually optimise the register use because the optimizer is totally off on that part. Furthermore I decided to pack the bytes into registers so I use less of them.

With for your suggestion of using + instead of or I got a +20% improvement when setting the bytes in uint32_t. So thanks a lot!

I tried bfi, it lead to a 30/40% throughput penalty on setting the bytes. (and increased register usage too)

Oh thanks!, I was looking for permute, I wasn’t able to remember the name. I will try that

My observations: The compiler would not infrequently use PRMT to translate byte-manipulation code based on masks and shifts on older GPU architectures, but does not do so on newer architectures. This suggests that the throughput of PRMT is unattractive on newer architecture (say, Pascal and up).

Giving the overall improvements to code generation in the CUDA compiler through the releases, I find that surprising. What CUDA version are you using, and what architecture are you compiling for? Could you post relevant code? I am curious as to how the compiler falls short. Note that the abstract C++ semantics demand that data with integer types smaller than int is widened to int first before entering into an expression. Optimizers can sometimes work around that under the ‘as-if’ rule (i.e. the generated code must behave as if it follows the abstract semantics), but generally things like packing multiple pieces of narrow data into a register is not something compilers can do.

In my observation, the CUDA compiler has occasional weaknesses when it comes to handling 64-bit integers, but generally generates excellent (and sometimes surprising!) code when manipulating register-sized (32-bit) integers.

1 Like

I must admit that I was kind of trying to defeat the optimiser to test whether my data structures were able to fix the generated assembly and to what extent.

Here’s a reproducer:

#include <array>
#include <cuda.h>

__global__ void byte_stack_array_kernel(volatile uint *ptr, uint init = 1022) {
    constexpr int array_size = 16;
    uint8_t arr[array_size] = {static_cast<unsigned char>(*ptr)};
    for (int c = 0; c < 100; ++c) {
        arr[0] = arr[(c + arr[0]) % array_size];
        arr[1] = arr[c % array_size];
        arr[(init + c) % array_size] = c * init;
        arr[(c + init) % array_size] = arr[1];
    *ptr = arr[*ptr % array_size];

int main() {
    uint *ptr;
    cudaMalloc(&ptr, sizeof(*ptr));
    byte_stack_array_kernel<<<1024, 1024>>>(ptr);

From ptxas I get:

    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, 332 bytes cmem[0]

Which is totally expected from such an extreme/not “registerizable” case. I gave a talk a few days ago on optimising SYCL to target the CUDA backend of DPC++. I was able to put this array of bytes into registers and packing four bytes per register.

I end up with no stack-frame usage and only 20 registers used (on clang)! You may find the code here

__global__ void byte_stack_array_kernel(volatile uint *ptr, uint init = 1022) {

Why is it necessary to use volatile here? Using volatile usually necessitates skipping various compiler optimizations. In particular it requires that each access to a data object actually takes place as written, and the data cannot be buffered in a register (as the underlying storage my have changed value between accesses). Outside of specialized code (e.g. interaction with mapped special HW registers), I would not expect to see volatile being used in C or C++ code.

I agree, but trying to generate better PTX won’t hurt ptxas when assembling down to SASS. Furthermore the benchmarks and register/stack-frame are in accordance with the hints given by the ptx in this very case.

No, volatile isn’t necessary at all, anymore. It was used to force an observable behaviour during my tests. Removing it doesn’t change the stack-frame usage and increases the register usage to 33 when targeting sm_75

If you set the size of the array to two, I still get with nvcc some stack-frame usage. I really think that it should be optimised away, even if I use “dynamic indexing”.

I recognise that no one is writing this kind of a kernel in a hot loop. My initial issue is that I had some code that was using in a single place a “dynamic indexing” and It was sending a whole struct to the stack. But everything is solved. I was just trying to optimise the implementation :)

Yes, I figured that out at the end. But it’s not surprising, it’s a very complex transformation and the cost-benefit ratio seems hard to estimate. Anyway, that’s why I was trying to do byte manipulation in the fastest way ;)

It actually might, especially when hand-coding instructions like bfi, bfe, or permute at PTX level. The general problem is one of complexifying code early, rather than late. I haven’t checked recently, but I doubt ptxas can pick apart complex operations in the course of optimization. For PTX layer code, it is usually advisable to stick to simple operations, and let ptxas figure out which complex operations offered by the hardware it can use to achieve a performance advantage on a particular GPU architecture. This might result in ISCADD for one architecture, but IMAD for another one, and PRMT for a third.

Historically, the CUDA toolchain had a few issues with complex operations generated at PTX level. For example, in one code I looked at it emitted a 64-bit bfe (or bfi, I don’t recall which), which at the SASS level turned into a called subroutine as there was no hardware support for this operation. Yikes!

Oops, I wasn’t referring to using inline PTX instructions. It was more general: in some cases, for kernels like the one I sent you, I was able, when adding a layer of indirection in C++ to generate kernels that didn’t use any stack-frame and were basically reduced to a single store. If I recall it was boiling down to 0-initialising an array, and returning a value you accessed with a dynamic index. It was “provable” that you were not accessing out of bounds. But NVCC/PTXAS miss the optimisation (should I considering reporting that as a bug?)

Interesting thread, as I have just been down the exact same path, as part of “scrape the last 0.5% performance improvement”, round of optimisations.

I had not come across using add instead of “or”, so I just gave it a try on a kernel where this operation is a relatively small contributer overall. On sm6.1, using Cuda 10.2, shl and or was the worst, adding was about the same duration-wise, but used less instructions and using three __byte_perms gave a noticable performance improvement and used less instructions again.

Interestingly, on the original shl and or, the compiler did generate the same perm instructions, but added some other xmad instructions, possibly due to there being some operand indexing arithmetic involved.

Again, to put this in perspective, the difference between all was probably well below most sane performance tolerances… :)

EDIT: For clarity, the above was taking a byte from 4 x uint_32 and packing them into a uint_32.

As long as the bit-fields being combined are not overlapping, compilers should be able to make the trade-off between shift-or and shift-add idioms by themselves. But this requires additional analysis, and as far as I recall from playing with the compilers offered by Compiler Explorer only some of the latest ones do this successfully.

On some architectures, like ARMv7, shift-or and shift-add both exist and have the same speed, but many other architectures only have some version of shift-add so substituting by hand is usually something worth trying.

The other processor aspect that plays into this is resource utilization (e.g. execution units in GPUs, or issue ports in x86 CPUs): one choice may be superior to another choice in a specific context by spreading the work across available resources more evenly.

1 Like

So, when using that class, you can’t actually do this:

    arr[(init + c) % array_size] = c * init;

you must do something like

  arr.write((init + c) % array_size, c*init);

Or am I not understanding that there is an operator[] overload there that works for the write path?

Yes, you’re totally right, there’s only a read-only operator[], for two reasons:

  1. When doing byte addressing you cannot return a reference to a sub-part of a word/register, that would be a reference to a temporary. :( You wouldn’t be able to write to it. It’s the vector<bool> issue.

  2. When doing word addressing, if you store through a reference, the array ends up on the stack-frame. The aim of the whole thing is to keep the array in registers, but as there’s no way to address the registers, the reference will have to be one to the stack-frame, thus defeating the purpose. When you return a const reference and use the value right away, LLVM is able to see through and keep the array in registers. But if you bind the return value to a const &, it will very probably be sent to the stack.

I will probably remove all returns of const &. I haven’t tried looking at the generated code by nvccas I was targeting LLVM. But the craziest part of that runtime_index_wrapper is that I get performance improvements on x86. I will probably write a blog post on these things

I’m really impressed by the fact that + performs so much better than or. Do you know why it’s happening? I would expect the computing for or to have less latencies as there’s no carry. But the SM were probably designed with scalar operations in mind.

Do you think x86 could figure that out at the micro-instruction level? Seems pretty hard, but would help.

I am frankly surprised that you observed a largish performance difference from switching ORs to ADDs. I would have expected a small difference only.

You can dump the SASS (machine code) with cuobjdump --dump-sass and compare the variants. On modern processors logical instructions and integer adds are usually handled by the same ALUs with the same latency and throughput and are therefore interchangeable from a performance perspective. I would assume that logical operations require less energy, though.

Older GPU architectures had an ISCADD instruction (integer scale and add), which does a left shift plus an add. It looks like the latest GPU architectures (Ampere and Turing) may have eliminate this instructon. But newer GPU architectures support an LEA instruction (intended to speed up 64-bit addressing computations) which can be used in much the same way, and some GPU architectures have fast 32-bit IMAD (integer multiply and add) that can also be used, but is energetically more expensive. Newer GPU architectures make maximum use of 3-input operations, in particular IADD3 (three-input add, presumably implemented as a carry-save adder bolted to a carry-propagate adder) and LOP3 (implements any logical operation of three operands; conceptually a lookup table, but not necessarily implemented that way). On the latest architectures these have probably the same throughput as plain IADD and LOP. The throughput of the PRMT unit likely varies by GPU architecture and is likely lower than simple ALU operation because it is not frequently needed.

I would expect any efficient byte gathering operations expressed in CUDA code at HLL level to be mapped to a mix of some of the operations enumerated above.

x86 has limited scale-and-add in the form of the LEA instruction, but obviously the available scale factors (1x, 2x, 4x, 8x) are not directly sufficient for byte manipulation. PowerPC processors have a versatile rlwimi (rotate left word immediate plus masked insert) which the compilers do an excellent job of utilizing for byte insert and extract operations.

The last x86 processor architecture I helped create was the AMD Athlon processor (so 20+ years ago). While it used an x86 instruction to internal instruction set translation mechanism, there weren’t any clever optimization that combined internal operations across x86 instruction boundaries. At least not that I recall. I don’t know what modern x86 processors can do.

The “Dissecting Turing” paper has the latencies on page 40: and the throughputs for most are in the Guide: Programming Guide :: CUDA Toolkit Documentation

On Pascal, testing I did using a shared memory bandwidth limited inner loop, showed no difference between BFE, SHL and PRMT in speed.

Later: Looking at page 53 → of “Dissecting”, is also instructive in showing the number of instructions retired after Pascal. There are obviously many new ones that have been added. It’s unfortunate there is no official documention on their functions.

I don’t see that many instructions that have been added in that list. And some that look like they have been added like BREV (bit reverse) and VABSDIFF4 (byte-wise [sum?] absolute differences) were in existence on older architectures like Kepler.

I am surprised to see that both BFI and BFE seem to be gone in Volta and Turing (at least no opcode is listed for these). Presumably replaced by emulation built from SHF (shift via funnel shifter) and/or LEA?

Simplifying the set of available machine instructions should actually be helpful for the compiler and ptxas in particular.