Loop unroll & remainder perf

I have a loop that I want to unroll:

int idx = blockDim.x*blockIdx.x+threadIdx.x;
#pragma unroll (N)
for(auto i =0; i<N; i++){
    int j = idx*N + i;
    if(j>MAX) break;
    x[j] = 1337; /* do something with x[j] */
}

I need that (j>MAX) because the work does not necessarily divide by N. I guess that it hurts the unroll performance because now there would be N “check if j> MAX” instructions… An alternative would be:

int idx = blockDim.x*blockIdx.x+threadIdx.x;
if((idx+1)*N-1 > MAX) {
// let's call this loop LA
    for(auto i =0; i<N; i++){
    int j = idx*N + i;
    if(j>MAX) break;
    x[j] = 1337; /* do something with x[j] */
}
}else{
#pragma unroll (N)
for(auto i =0; i<N; i++){
    int j = idx*N + i;
    x[j] = 1337; /* do something with x[j] */
}
}

Now the unrolled loop does not have the pesky condition check, and for that last thread that should handle the remainder it would do the not-unrolled, version.
Q1. Which one is better?
Q2. I can still unroll the first loop (LA) too, would it be better?
If needed, let’s assume N is 8, or 16 or 32 at most, whereas MAX is ~100K and the kernel is launched as <<<Q,512>>> where Q is MAX/(N*512)
Or maybe just bite the bullet, and waste up to 63 bytes (at worst) and pad the x array so that there is no loop remainder?

One way to answer these questions would be to do some benchmarking. The performance benefits associated with loop unrolling depend heavily on the contents of the body of the loop. For the most part, you haven’t shown that.

FWIW, I hate to see this:

int j = idx*N + i;
x[j] = 1337;

It just makes me sad, dejected, ready to give up. ay caramba! It breaks one of the two most important principles of CUDA performance (strive for coalesced access).

If that were my code, I would focus on that first before spending any time on loop unrolling studies.

To @RobertCrovella’s point: For optimal coalesced memory access, the threads in a warp should be accessing memory in a contiguously interleaved pattern, the canonical code for which looks like this:

    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < N; i += stride) {
        x[i] = <expression>;
    }

The kind of memory access pattern shown in the question (per thread contiguous) might easily cut memory throughput by a factor of 10 compared to fully coalesced access.

1 Like

I see your ‘ay caramba’ & raise you a ‘don’t have a cow, man!’ :) So would you be happy if I re-arranged the memory access to ‘int j = Q*i + idx’; ? (so that at each step ‘i’ of the loop, threads access sequential memory locations) -but it’s really not my point here, I want to know how to handle the loop remainder-
obviously I’m not reimplementing memset, x[j] = 1337 is a placeholder, it’s something like:

int cli = d_i2c[j];
int vv = x[j];
int index = abs(vv)-1;
bool v0 = d_data[index];
bool v = (x[j]<0) ^ v0;
atomicAdd(d_flgs+cli,1);

Read this as: few bitwise stuff and 1 atomic operation. This is the meat of the loop. It’s one question whether is loop unrolling gonna help or not, and another is: assuming we want to unroll, which one is better?

Yes, I understood your question was about loop unrolling. And to the best that I could (not very good) I tried to address that first.

The coalescing observation was secondary. If you are aware of it (seems that you are) great.

I think loop unrolling will help. I don’t have the function prototype but I’m assuming things like d_i2c and others are global memory. (little snippets of code frequently lead to these kinds of stumbling blocks when trying to provide useful commentary). The unrolling may help with earlier scheduling of those global loads. I would also try to chose an unroll size in the range you indicated e.g. 8 or 16, to start with, then “shmoo” around to see if there are any local minima.

I like the approach you have where you have the fixed-trip loop off by itself, unroll that, and use that to cover the majority cases. That makes sense to me and I would expect the compiler would get it also. if it were me, I would start there, then incrementally try your other changes.

But that is all hand-wavy stuff. In about the amount of time we’ve spent so far on this, you could probably get more insight by just trying a few cases, doing benchmarking.

I doubt there is general wisdom that correctly guides in these kinds of detailed differences you are proposing. At least, I don’t have it. So if it were me, I would be writing benchmarks.

1 Like

Generally speaking, if you want to unroll loops and derive a significant performance benefit from it, it is advantageous to move loop-dependent if-statements out of the loop body first.

The CUDA compiler knows how to fully and partially unroll loops where it is deemed profitable by heuristics. There is usually no need to do it by hand. The compiler also knows how to deal with loop counts that are not a multiple of the unroll factor, by splitting a loop into an unrolled looped handling trunc(N/unroll_factor) iterations and a fully rolled cleanup loop handling the N-trunc(N/unroll_factor)*unroll_factor remaining iterations. If unrolling is desired where the compiler by default supplies none, the first thing to try is to add a #pragma unroll with the desired unrolling factor. Manual unrolling should be a method of last resort.

Above all, optimization work should be directed at the bottlenecks identified by the CUDA profiler. From the information that has been presented here, the code is limited by effective memory throughput, and unrolling the loop won’t fix that.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.