Unrolling makes performance worse

Yiqi · April 17, 2019, 10:54pm

I wrote a simple code: compute the addition of two vectors.
Following Chapter 4 in the book Professional CUDA C Programming, I’m trying unrolling to get better performance.

Code

full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch4/temp.cu

base code:

__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) C[idx] = A[idx] + B[idx];
}

......

sumArraysOnDevice<<<grid, block>>>(d_A, d_B, d_C, nElem);

x2 unrolling:

__global__ void sumArraysOnDeviceUnroll2(float *A, float *B, float *C, const int N) {
    int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;
    if (idx + blockDim.x < N) {
        C[idx] = A[idx] + B[idx];
        C[idx + blockDim.x] = A[idx + blockDim.x] + B[idx + blockDim.x];
    }
}

......

sumArraysOnDeviceUnroll2<<<grid.x / 2, block>>>(d_A, d_B, d_C, nElem);

x4 unrolling:

__global__ void sumArraysOnDeviceUnroll4(float *A, float *B, float *C, const int N) {
    int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;
    
    if (idx + 3 * blockDim.x < N) {
        C[idx] = A[idx] + B[idx];
        C[idx + blockDim.x] = A[idx + blockDim.x] + B[idx + blockDim.x];
        C[idx + 2 * blockDim.x] = A[idx + 2 * blockDim.x] + B[idx + 2 * blockDim.x];
        C[idx + 3 * blockDim.x] = A[idx + 3 * blockDim.x] + B[idx + 3 * blockDim.x];
    }
}

......

sumArraysOnDeviceUnroll4<<<grid.x / 4, block>>>(d_A, d_B, d_C, nElem);

Results

I observed no performance gain after unrolling, and even some performance drop sometimes…

block.x = 1024:

no unrolling <<< 262144, 1024 >>> elapsed 8.922000 ms
unroll2      <<< 131072, 1024 >>> elapsed 8.960000 ms
unroll4      <<< 65536, 1024 >>> elapsed 8.990000 ms

block.x = 512:

no unrolling <<< 524288,  512 >>> elapsed 8.934000 ms
unroll2      <<< 262144,  512 >>> elapsed 8.923000 ms
unroll4      <<< 131072,  512 >>> elapsed 8.963000 ms

block.x = 256:

no unrolling <<< 1048576,  256 >>> elapsed 8.881000 ms
unroll2      <<< 524288,  256 >>> elapsed 8.906000 ms
unroll4      <<< 262144,  256 >>> elapsed 8.932000 ms

block.x = 128:

no unrolling <<< 2097152,  128 >>> elapsed 8.911000 ms
unroll2      <<< 1048576,  128 >>> elapsed 8.944000 ms
unroll4      <<< 524288,  128 >>> elapsed 8.978000 ms

More info

Hardware: I’m using GTX 1080 Ti GPU

njuffa · April 17, 2019, 11:27pm

(1) I don’t see a loop in the original code you posted above, therefore there is no unrolling here.
(2) The original code is limited by memory bandwidth, trying to “save instructions” isn’t going to do any good.
(3) For memory-bound code, access patterns can be crucial to achieve the highest possible memory throughput. The original code already uses the most suitable access pattern.

How to find out that a kernel is memory bound, and which memory access patterns are most advantageous is all covered in the copious documentation that comes with CUDA. For performance work, you would want to become familiar with the CUDA profiler and study the Best Practices Guide in particular.

Yiqi · April 17, 2019, 11:50pm

Thanks a lot for the reply.

Sorry about the wrong statement, it should be “unrolling” instead of “unrolling loops”.

Actually I’m following Chapter 4, Section “Performance Tuning” in the book Professional CUDA C Programming,
http://www.hds.bme.hu/~fhegedus/C++/Professional%20CUDA%20C%20Programming.pdf

where thers is a performance gain.

I noticed that 1080 Ti is based on Pascal architecture, while the book is based on Fermi and Kepler. I wonder what difference between the architectures causes this different result on this experiment.

njuffa · April 18, 2019, 12:08am

I have never seen the term unrolling being applied to anything other than loops (which are the only “rolled” form of control transfer) and I have been involved with software engineering for 30 years now.

I am not familiar with the book “Professional CUDA C Programming” and therefore can’t judge its accuracy. But I note that from what I have seen in terms of questions in these forums, it seems to be the source of more confusion and misconceptions than any other CUDA publication. Proceed with caution if you intend to continue to work with this book. Maybe the technique the authors suggest and that you tried worked in limited circumstances at some point; I am not going to read the book to find out.

A general issue with older books on CUDA is that some information in the books no longer applies to modern GPU architectures, say Maxwell and later. The memory sub-system in particular is much more sophisticated in newer GPUs compared to older architectures. The still popular book “CUDA by Example” suffers most from reference to out-of-date concepts, as it was published ten years ago when the Fermi architecture was new.

I often recommend “The CUDA Handbook” by Nicholas Wilt for its concise and pretty much error-free writing. Full disclosure: I served as a reviewer for some chapters of this book before it came out and my opinion may therefore be biased. And it has been five years or so since this book was published, so it may also show signs of aging by now.

A more versatile canonical traversal pattern for 1D-arrays in CUDA is:

int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
    dst[i] = ...  src[i];
}

Unless there is some very expensive processing involved to generate ‘dst’ from ‘src’, such code will be memory bound. By all means use the CUDA profiler to confirm instead of relying on handwaving.

Greg · April 18, 2019, 4:43pm

The process of adding more work-items per thread can improve performance by

reducing registers per work-item as two work-items in the same thread may share common values, or
adding additional compute work between memory accesses, or
potentially eliminating tail effect in a thread block.

A simple vector that is memory throughput limited does not benefit from any of the above optimizations. For this code the SM will be at maximum occupancy, all warps will be stalled on memory loads (long_scoreboard), and the SM math pipelines will be mostly idle.

Yiqi · April 18, 2019, 6:12pm

@njuffa One more question about the code you provided:

int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
    dst[i] = ...  src[i];
}

It seems that (i + stride) is guaranteed to exceed the maximum index, right? (cause stride = total # threads)
So why write a for loop here if this loop body will always be executed once?

Thanks~

Robert_Crovella · April 18, 2019, 7:40pm

It’s a grid-stride loop, and the assumption is that it may be used in the case where the total number of threads launched is less than the size of the dataset. In that case, the loop will iterate more than once (at least for some threads).

[url]https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/[/url]

njuffa · April 18, 2019, 7:42pm

The idiom you used in your code assumes that the number of threads is >= the number of array elements. The idiom I provided accounts for the fact that a kernel may have to handle more array elements than there are threads. In that case this loop is actually going to run through more than one iteration.

Out-of-bounds accesses are prevented by the condition “i < len”, where “len” is the number of elements in the 1D-array that is being traversed, and “i” is the index into the array.

Topic		Replies	Views
CUDA compiler bug or user error? CUDA Programming and Performance	28	2464	July 28, 2017
loop unrolling CUDA Programming and Performance	11	16965	January 31, 2008
Different output of code when not unrolling loop CUDA Programming and Performance	16	1059	August 22, 2022
'for' loop performance hacks? CUDA Programming and Performance	17	10409	February 28, 2009
An Even Easier Introduction to CUDA Technical Blog	141	6055	November 28, 2023
Speed improvement CUDA Programming and Performance	18	8267	December 5, 2008
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops Technical Blog	18	1452	September 12, 2019
Does launching a kernel with just 1 thread with 1 block ensures sequential execution in GPU? CUDA Programming and Performance cuda , kernel	5	932	December 17, 2023
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1227	March 25, 2022
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010

Unrolling makes performance worse

Code

Results

More info

Related topics