Unrolling makes performance worse

I wrote a simple code: compute the addition of two vectors.
Following Chapter 4 in the book Professional CUDA C Programming, I’m trying unrolling to get better performance.


full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch4/temp.cu

base code:

__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) C[idx] = A[idx] + B[idx];


sumArraysOnDevice<<<grid, block>>>(d_A, d_B, d_C, nElem);

x2 unrolling:

__global__ void sumArraysOnDeviceUnroll2(float *A, float *B, float *C, const int N) {
    int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;
    if (idx + blockDim.x < N) {
        C[idx] = A[idx] + B[idx];
        C[idx + blockDim.x] = A[idx + blockDim.x] + B[idx + blockDim.x];


sumArraysOnDeviceUnroll2<<<grid.x / 2, block>>>(d_A, d_B, d_C, nElem);

x4 unrolling:

__global__ void sumArraysOnDeviceUnroll4(float *A, float *B, float *C, const int N) {
    int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;
    if (idx + 3 * blockDim.x < N) {
        C[idx] = A[idx] + B[idx];
        C[idx + blockDim.x] = A[idx + blockDim.x] + B[idx + blockDim.x];
        C[idx + 2 * blockDim.x] = A[idx + 2 * blockDim.x] + B[idx + 2 * blockDim.x];
        C[idx + 3 * blockDim.x] = A[idx + 3 * blockDim.x] + B[idx + 3 * blockDim.x];


sumArraysOnDeviceUnroll4<<<grid.x / 4, block>>>(d_A, d_B, d_C, nElem);


I observed no performance gain after unrolling, and even some performance drop sometimes…

block.x = 1024:

no unrolling <<< 262144, 1024 >>> elapsed 8.922000 ms
unroll2      <<< 131072, 1024 >>> elapsed 8.960000 ms
unroll4      <<< 65536, 1024 >>> elapsed 8.990000 ms

block.x = 512:

no unrolling <<< 524288,  512 >>> elapsed 8.934000 ms
unroll2      <<< 262144,  512 >>> elapsed 8.923000 ms
unroll4      <<< 131072,  512 >>> elapsed 8.963000 ms

block.x = 256:

no unrolling <<< 1048576,  256 >>> elapsed 8.881000 ms
unroll2      <<< 524288,  256 >>> elapsed 8.906000 ms
unroll4      <<< 262144,  256 >>> elapsed 8.932000 ms

block.x = 128:

no unrolling <<< 2097152,  128 >>> elapsed 8.911000 ms
unroll2      <<< 1048576,  128 >>> elapsed 8.944000 ms
unroll4      <<< 524288,  128 >>> elapsed 8.978000 ms

More info

Hardware: I’m using GTX 1080 Ti GPU

(1) I don’t see a loop in the original code you posted above, therefore there is no unrolling here.
(2) The original code is limited by memory bandwidth, trying to “save instructions” isn’t going to do any good.
(3) For memory-bound code, access patterns can be crucial to achieve the highest possible memory throughput. The original code already uses the most suitable access pattern.

How to find out that a kernel is memory bound, and which memory access patterns are most advantageous is all covered in the copious documentation that comes with CUDA. For performance work, you would want to become familiar with the CUDA profiler and study the Best Practices Guide in particular.

Thanks a lot for the reply.

Sorry about the wrong statement, it should be “unrolling” instead of “unrolling loops”.

Actually I’m following Chapter 4, Section “Performance Tuning” in the book Professional CUDA C Programming,

where thers is a performance gain.

I noticed that 1080 Ti is based on Pascal architecture, while the book is based on Fermi and Kepler. I wonder what difference between the architectures causes this different result on this experiment.

I have never seen the term unrolling being applied to anything other than loops (which are the only “rolled” form of control transfer) and I have been involved with software engineering for 30 years now.

I am not familiar with the book “Professional CUDA C Programming” and therefore can’t judge its accuracy. But I note that from what I have seen in terms of questions in these forums, it seems to be the source of more confusion and misconceptions than any other CUDA publication. Proceed with caution if you intend to continue to work with this book. Maybe the technique the authors suggest and that you tried worked in limited circumstances at some point; I am not going to read the book to find out.

A general issue with older books on CUDA is that some information in the books no longer applies to modern GPU architectures, say Maxwell and later. The memory sub-system in particular is much more sophisticated in newer GPUs compared to older architectures. The still popular book “CUDA by Example” suffers most from reference to out-of-date concepts, as it was published ten years ago when the Fermi architecture was new.

I often recommend “The CUDA Handbook” by Nicholas Wilt for its concise and pretty much error-free writing. Full disclosure: I served as a reviewer for some chapters of this book before it came out and my opinion may therefore be biased. And it has been five years or so since this book was published, so it may also show signs of aging by now.

A more versatile canonical traversal pattern for 1D-arrays in CUDA is:

int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
    dst[i] = ...  src[i];

Unless there is some very expensive processing involved to generate ‘dst’ from ‘src’, such code will be memory bound. By all means use the CUDA profiler to confirm instead of relying on handwaving.

The process of adding more work-items per thread can improve performance by

  1. reducing registers per work-item as two work-items in the same thread may share common values, or
  2. adding additional compute work between memory accesses, or
  3. potentially eliminating tail effect in a thread block.

A simple vector that is memory throughput limited does not benefit from any of the above optimizations. For this code the SM will be at maximum occupancy, all warps will be stalled on memory loads (long_scoreboard), and the SM math pipelines will be mostly idle.

@njuffa One more question about the code you provided:

int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
    dst[i] = ...  src[i];

It seems that (i + stride) is guaranteed to exceed the maximum index, right? (cause stride = total # threads)
So why write a for loop here if this loop body will always be executed once?


It’s a grid-stride loop, and the assumption is that it may be used in the case where the total number of threads launched is less than the size of the dataset. In that case, the loop will iterate more than once (at least for some threads).


The idiom you used in your code assumes that the number of threads is >= the number of array elements. The idiom I provided accounts for the fact that a kernel may have to handle more array elements than there are threads. In that case this loop is actually going to run through more than one iteration.

Out-of-bounds accesses are prevented by the condition “i < len”, where “len” is the number of elements in the 1D-array that is being traversed, and “i” is the index into the array.