Pipeline operator forwarding for integer instructions in CUDA

Hello there. I am trying to implement a bit-level shuffler in CUDA. What I write is something like this (X, Y are uint32’s)

  Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 16);
  Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 8);
  Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 4);
  Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 2);
  Y = ((X >> 1) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 1);
  Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 2);
  Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 4);
  Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 8);
  Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 16);

DUMMY_SHUFFLE_CONST is a constant that I already read from global memory. Each pair of lines is a shuffle operation.

  Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 16);

Theoretically on a superscalar CPU, the above 2 lines take 5 cycles to finish (it has 6 integer instructions, of which 2 can be parallelized).
However, when I profile it on GPU, I get something similar to 60 cycles, which is 12x larger.

I wonder if this is due to CUDA not having pipeline forwarding and forces stalls between data-dependent instructions?

GPUs are optimized for throughput, not latency.

Each thread in the GPU can handle one integer operation per cycle, i.e. the per-thread execution is scalar, like in a low-end ARM processor. The per-thread latency here is something like 6 cycles per stage, times 9 stages as shown above, so the total is close to the 60 cycles you are measuring.

The way to use a GPU “properly” is to have on the order of tens of thousands of threads running concurrently and independently of each other, where each thread performs this computation. That allows us to achieve 10x the throughput of a CPU. A classical example of such massively parallel integer-intensive processing would be various flavors of crypto-currency mining.

If your use case does not offer massive parallelism, or is latency constrained (e.g. high frequency trading), a GPU is probably not the right processor for the task at hand.

True stalls are relatively rare on GPUs. As soon as a thread is not making forward progress (e.g. it is waiting for load data, or waiting for the result of a multi-cycle operation such as an integer multiply) it gets swapped out and another thread gets to run instead. The mental model is that such context switching has zero overhead.

The profiler shows the elapsed cycles for the whole kernel. How did you determine that the shown portion of code takes 60 cycles? Can you show your benchmark program ?

When I look at the SASS code for this kernel on compiler explorer Compiler Explorer

__global__ void kernel(const unsigned int DUMMY_SHUFFLE_CONST, unsigned int* x, unsigned int* y) {
    unsigned int X = x[threadIdx.x];
    unsigned int Y = y[threadIdx.x];
    Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 16);
    Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 8);
    Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 4);
    Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 2);
    Y = ((X >> 1) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 1);
    Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 2);
    Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 4);
    Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 8);
    Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 16);
    x[threadIdx.x] = X;
    y[threadIdx.x] = Y;
}

,each pair of lines is implemented in four instructions (1 shift, 1 add, 2 LOP). It appears to me that the final results are either independent of X or independent of Y, because the kernel only loads 1 input value (I cannot tell if it loads X or Y)

@njuffa Thanks for your note on the throughput nature of GPUs. But my application cares about both throughput and latency - millions of iterations like this.
@striker159 Thanks for your in-depth help! You are right that only X is the input and Y is just a temporary variable.
I wrote a kernel like this:

__global__ void kernel(const unsigned int DUMMY_SHUFFLE_CONST, unsigned int* x, unsigned int* y) {
    unsigned int X = x[threadIdx.x];
    for(int i = 0; i < 1000; ++i) {
    Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 16);
    Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 8);
    Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 4);
    Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 2);
    Y = ((X >> 1) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 1);
    Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 2);
    Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 4);
    Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 8);
    Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 16);
    }
    x[threadIdx.x] = X;
}

Then, I measure the number of cycles in Nsight Compute, it shows 596,683 cycles. Divide that by the iteration count 1000 we can get about 600 cycles for all 9 stages of shuffle. So each stage (i.e., pair of lines) takes about 66 cycles.

@striker159 Please see my above comment. I edited the comment to include @ mention and don’t know if you get the reply notification. Thanks again!

Without knowing the launch configuration for the kernel and the GPU used the number of cycles reported by Nsight Compute is kind of meaningless. I would suggest posting a minimal complete reproducer, i.e. a program that can be cut & pasted, compiled & run by 3rd parties and reproduces your observations.

You are right. I will try make a minimal compilable example and post my full configuration here later. Thanks!

I set up a simple benchmark and executed it on an RTX 4090 (sm_89).

__global__ void kernel(const unsigned int DUMMY_SHUFFLE_CONST, unsigned int* x) {
    const size_t tid = size_t(threadIdx.x) + size_t(blockIdx.x) * size_t(blockDim.x);
    unsigned int X = x[tid];
    for(int i = 0; i < 1000; ++i) {
        unsigned int Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 16);
        Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 8);
        Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 4);
        Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 2);
        Y = ((X >> 1) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 1);
        Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 2);
        Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 4);
        Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 8);
        Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
        X = (X ^ Y) ^ (Y << 16);
    }
    x[tid] = X;
}



int main(){
    const size_t N = 512*1024*1024;
    
    unsigned int* x;
    cudaMalloc(&x, sizeof(unsigned int) * N);
    kernel<<<1024*1024, 512>>>(0xABBADABA, x);
    cudaDeviceSynchronize();
}

The profiler reports a kernel duration of 826.7 ms, and 1,847,663,273 cycles.

Here is the profiler report exported as image.

I just made a small example. The launch configuration is just 1 block and 1024 threads for simplicity. I use nvcc V12.4.131 and my GPU is an A100-SXM4-80GB.

To compile, just nvcc kerneltest.cu -o kerneltest -O2

#include <iostream>
#include <cassert>

__global__ void kernel(const unsigned int DUMMY_SHUFFLE_CONST, unsigned int* x) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  unsigned int X = x[index], Y;
  for(int i = 0; i < 1000; ++i) {
    Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 16);
    Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 8);
    Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 4);
    Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 2);
    Y = ((X >> 1) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 1);
    Y = ((X >> 2) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 2);
    Y = ((X >> 4) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 4);
    Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 8);
    Y = ((X >> 16) ^ X) & DUMMY_SHUFFLE_CONST;
    X = (X ^ Y) ^ (Y << 16);
  }
  x[index] = X;
}

int main() {
  unsigned int *x;
  assert(cudaMalloc(&x, 1024) == cudaSuccess);
  kernel<<<1, 1024>>>(0xfacebabe, x);
  assert(cudaDeviceSynchronize() == cudaSuccess);
  return 0;
}

Nsight Compute shows 456,697 cycles.

When making this example, I found if I reduce the number of threads in a block to 128, the number of cycles can be reduced to ~200,000. So I now suspect it is because my block shape is too large.

This probably means, by limiting the block size, we can have a better latency.

With a single-warp block <<<(1,1,1), (32,1,1)>>>, this yields 168,988 cycles, which means about 18.777 cycles per line pair.

And, that means if CUDA core works like a superscalar processor, with 5 instructions on the critical path, each instruction takes about 3.75 cycles. Probably means CUDA is indeed without pipeline forwarding?

An RTX 4090 has a throughput of about 4.1e13 simple integer operations per second. Off the top of my head, I do not know whether shifts have the same throughput as ALU operations. Let’s assume they do for now.

The kernel is therefore equivalent to the execution 3.4e13 simple integer operations. The loop body is therefore (divided by 512 * 1024 * 1024 * 1000) equivalent to executing 63.3 simple (single-cycle, full throughput) integer operations.

I would call that a plausible result given the source for the loop body. Now @striker159 remarked that per disassembly of a release build of the kernel, there are only about 40 instructions in the loop body. To explain the discrepancy, use the profiler to find inefficiencies (maybe register bank conflicts?) and double-check the documentation to make sure that all instructions that occur in the disassembly of the loop body are in fact full-throughput integer instruction.

In other words, if this is quick experiment to assess what kind of performance you can expect for your use case, you are at most off by a factor of 1.5x. Not bad for five minutes of work.

Pipeline forwarding means the output of the execution pipe is delivered to the next dependent instruction’s input to the pipe reducing the latency to writeback the output to the register file and read the value from the register file. This could be as small as 1 cycle (no delay) if the execution pipeline is only 1 cycle or it could simply reduce dependent latency by a few cycles.

The code has 4 dependent instructions.

1 ALU   SHF.R.U32.HI R5, RZ, 0x4, R4
2 ALU   LOP3.LUT R5, R5, c[0x0][0x160], R4, 0x48, !PT
3 FMA   IMAD.SHL.U32 R6, R5, 0x10, RZ
4 ALU   LOP3.LUT R6, R6, R5, R4, 0x96, !PT

3 instructions are executed by the ALU pipe.
1 instruction is executed by the FMA pipe.

The ALU pipe has a maximum throughput of 0.5 warp instructions/cycle per SMSP.
The FMA (IMAD) pipe has a maximum throughput of 0.5 warp instructions/cycle per SMSP.

The test by @gzz_2000 can be slightly modified to perform optimal launches of 1, 2, 3, … 8 warps per SMSP.

int main()
{
    cudaDeviceProp props;
    cudaGetDeviceProperties(&props, 0);

    unsigned int* x;
    cudaMalloc(&x, props.multiProcessorCount * 1024 * sizeof(unsigned int));
    
    kernel<<<props.multiProcessorCount, 128 * 1 >>> (0xfacebabe, x);
    kernel<<<props.multiProcessorCount, 128 * 2 >>> (0xfacebabe, x);
    kernel<<<props.multiProcessorCount, 128 * 3 >>> (0xfacebabe, x);
    kernel<<<props.multiProcessorCount, 128 * 4 >>> (0xfacebabe, x);
    kernel<<<props.multiProcessorCount, 128 * 5 >>> (0xfacebabe, x);
    kernel<<<props.multiProcessorCount, 128 * 6 >>> (0xfacebabe, x);
    kernel<<<props.multiProcessorCount, 128 * 7 >>> (0xfacebabe, x);
    kernel<<<props.multiProcessorCount, 128 * 8 >>> (0xfacebabe, x);
    
    cudaDeviceSynchronize();
    return 0;
}

The 1 warp per SMSP will show the full latency introduced by dependent instructions. As the number of warps per SMSP increases the SM will be limited by the instruction mix (ALU pipe).

ALU pipe can issue 0.5 instructions/cycle per SMSP.
FMA pipe can issue 0.5 instructions/cycle per SMSP.

Given the instruction mix below the maximum sustained throughput is 4 instructions / 6 cycles per SMSP.

1 ALU   SHF.R.U32.HI R5, RZ, 0x4, R4
2 ALU   LOP3.LUT R5, R5, c[0x0][0x160], R4, 0x48, !PT
3 FMA   IMAD.SHL.U32 R6, R5, 0x10, RZ
4 ALU   LOP3.LUT R6, R6, R5, R4, 0x96, !PT

    time in cycles -->
    1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
ALU A - A - A - A - A - A - A - A - A -
FMA       F -         F -         F -
A/F instruction issued
-   pipe is busy

Comparing 1-8 warps per SMSP it is possible to determine when dependent latency is hidden by additional warps and when ALU pipe is saturated.

ID  Grid        Block       Compute     ALU     IPC
    Size        Size        Throughput 
0   46, 1, 1    128, 1, 1   33.10       33.55   0.90
1   46, 1, 1    256, 1, 1   64.44       65.74   1.76
2   46, 1, 1    384, 1, 1   90.51       93.48   2.50
3   46, 1, 1    512, 1, 1   93.58       94.67   2.54
4   46, 1, 1    640, 1, 1   96.53       97.48   2.61
5   46, 1, 1    768, 1, 1   97.32       98.17   2.63
6   46, 1, 1    896, 1, 1   97.72       98.41   2.64
7   46, 1, 1   1024, 1, 1   97.83       98.62   2.64

The metrics used for ALU and IPC are:

ALU sm__inst_executed_pipe_alu.avg.pct_of_peak_sustained_active
IPC sm__inst_executed.avg.pct_of_peak_sustained_active

The maximum IPC is 2.64. SM maximum IPC for CC 7.0 - 9.0 is 4.0.
2.64 / 4.0 * 100. = 66%
which is the value predicted by the ALU and FMA instruction mix.

Depednent latency can be analyzed by reviewing the metric smsp__average_warps_issue_stalled_wait_per_issue_active.ratio = 3.35
cycles for all launch dimensions.

Grid ID > 1 are hiding the 3.35 cycles per instruction latency through additional FMA instructions (and startup instructions) until the ALU pipe is saturated. Saturation occurs approximately at 4 warps per SMSP.

Thanks @Greg for this detailed analysis. Does this mean GPU has no pipeline forwarding and my best practice is to stack more parallelism by adding more warps in order to fill the pipeline with independent warp instructions?

I think there is a typo: per SMSP. I am mentioning that anybody trying to follow it, is not stuck with that paragraph of that amazing analysis.

Approach 1
You could try to reduce the latency of each loop iteration for the cost of more parallel work.

As the algorithm is meant to be processed serially that will probably give you only a small improvement, if any.

You have the lines (and similar with other shift amounts)

   X = (X ^ Y) ^ (Y << 16);
   Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
   X = (X ^ Y) ^ (Y << 8);

The second line needs original X and X shifted right by 8.

The third line needs original Y and Y shifted left by 8.

You can incorporate the shifts into the formulas above and calculate both in parallel. E.g. for the first two lines:

   X0 = (X ^ Y) ^ (Y << 16);
   X8 = ((X >> 8) ^ (Y >> 8)) ^ ((Y & 0x0000FFFF) << 8);
   Y = (X8 ^ X0) & DUMMY_SHUFFLE_CONST;

X0 and X8 could be calculated by different threads of a warp and exchanged by shuffle instructions. The shifts within the expression for X8 could be moved further up.

However,
the algorithm is meant to be difficult to optimize in that way.
Compared to the arithmetic instructions, the shuffle instruction is slow. Not sure about its latency.

Approach 2

As you only have bit operations (SHIFT, XOR, AND), you could represent each initial bit of x (and y and DUMMY_SHUFFLE_CONST) as 32 (or 96) mathematical (logical) variables and try to solve one loop iteration. Perhaps you find a way for (optionally parallel) processing with less latency. Perhaps in the end, each bit could be calculated just by 32-bit multiplications (multiplications combine shifts with additions, which are XOR for the LSB)? Or perhaps you get the sum of 32 multiplications, where the first operand is just one bit, so effectively the sum of 32 shifted numbers. Here the Tensor units with 1-bit and XOR mode could be an option.

This could work better than approach 1.

Approach 3

Another possibility: The program would probably run (only?) in the order of 100,000x times slower for the pre-computation of all 2^32 possible values for x than for the computation of a single value. That could or could not be acceptable in your case. With an output array size of 16 GiB. This is irrespective of the number of loop iterations. In your example it would be about a minute of running time.

If it is about encryption or signing, that would be the time to break the encryption.

1 Like

Thanks! I fixed in the original response.

To my knowledge NVIDIA has not documented the SM micro-architecture at the level of details necessary to answer your question.

One of the most basic forms of optimization is to increase the latency hiding through one of two different mechanisms:

  1. warp occupancy
  2. instruction level parallelism

The modification I made to the kernel is an example of using warp occupancy. As can be seen in the limited result provided the benefit diminished at 4-5 warps/SMSP as the ALU pipe is fully saturated.

The other option would be to use instruction level parallelism. There are two approaches. The easiest is loop unrolling if there is no dependency between each loop iteration. The more complicated method is to manually interleave duplicate operations.

The completion of the first step of optimization is when you hit a throughput limiter. In the case above the limiter is the ALU pipes instruction throughput. The next optimization would need to be to determine if any of the operations could be done via a different pipeline such as using other FMA instructions or use of a lookup table.

1 Like

I think with Approach 2 it can be sped up by

  • Calculating once a 32-bit AND mask for each of the 32 target bits of X from the DUMMY_SHUFFLE_CONST => one mask for each thread.

And then for each iteration, a warp of 32 threads cooperates, each thread is responsible for one bit. In the loop

  • use the current X
  • AND with the respective mask
  • do a population count and
  • take the LSB
  • combine the bits of the 32 threads with ballot_sync to calculate the next X.

I would expect a latency speed-up by a factor of 4 compared to the current version.

But 32 threads would be needed instead of 1 thread.

However, I think this can be further improved by combining several iterations of the loop!
So the algorithm does not seem to be very secure.

1 Like

Forwarding just means that the result of an instruction is available to a dependent instruction without requiring the data to take a round-trip through the register file. It is safe to assume that all modern high-performance processors perform forwarding.

But forwarding cannot eliminate data dependencies, and it cannot eliminate limitations of an in-order scalar integer processing pipeline. So I am not sure how it is germane to the performance of the code presented here.

To my knowledge, ILP in GPUs is currently limited to floating-point operations and a mix of floating-point and integer operations, but does not apply to pure integer code, which is the case here. The principal means of parallel execution in a GPU is thread parallelism.

NVIDIA folks more intimately familiar with the latest GPU architectures are invited to correct me should my knowledge be outdated; I don’t want to mislead.

After consulting the latest CUDA Programming Guide, I think I need to correct my back-of-the-envelope computation for RTX 4090 that I performed above. As opposed to compute capability 7.x and 8.0, which provide equal amounts of FP32 and INT32 execution units, there is a 2:1 ratio of FP32:INT32 units in compute capability 8.9. This means I overstated the throughput of INT32 operation in RTX 4090 by a factor of 2, i.e. I should have started with 20.5e13 simple integer instructions / second, leading to an estimate of 32 instructions for the loop body. Which is a bit lower than what is observed (>= 9*4), but the difference may be due to dynamic clock boosting above the nominal frequency skewing the result.

1 Like

Thanks all for your brainstorm! I will think about your approaches carefully later. I can provide some background on what my application is doing and why it requires such bit manipulation. If you are not interested you can skip this.

It is actually not an encryption, but just a kernel to perform a global bit permutation across many 32-bit integers using Benes network.

For instance,

  Y = ((X >> 8) ^ X) & DUMMY_SHUFFLE_CONST;
  X = (X ^ Y) ^ (Y << 8);

Is just parallel looking at any two bits in X with distance (stride) 8, optionally perform a swap between them. If the corresponding bit in DUMMY_SHUFFLE_CONST is 1, the xor result in Y will be kept and then used to toggle both bits (effectively swapping them).

As a result, the DUMMY_SHUFFLE_CONST for different threads and different stages will be different when I use this kernel in practice. I might gonna use some software pipelining to read them from global memory.

My performance requirement lies in both latency and throughput, where latency is more concerning. My application works like: runs a permutation, does something with the permuted bits, runs the permutation again, … Each permutation is just a short kernel running in <1000 cycles. The faster the better, as there are a lot of iterations. (In practice, I think I will use some cooperative kernel instead of separate kernels to save the overhead)

Thanks again for all of you! This is really a helpful community.