Divide loop to increase occupancy

I want to process n=65536 elements divided into 1024 blocks of 64 like this but the achieved occupancy on my Titan V is suboptimal:

__global__ void DoStuff(float* stuff, int n){
    const auto i = blockIdx.x * blockDim.x + threadIdx.x;
    for(auto j = 0; j < n; ++j){
        auto x = stuff[i] - stuff[j];
    }
}

So I want to try dividing the loop into portions so I can use more blocks but I can’t get my head around to do it, could anyone seed a train of thought to get me started?

For example if I have twice the blocks(2048) I could have half the iterations of the loop on blocks 0-1023 and the the second half of the iterations on blocks 1024-2047.

Its probly stupid simple but my brain doesn’t want to cooperate.

Your use of the variable i as the thread index and in the for loop as the loop index does not make a lot of sense. The loop index variable shadows the outer variable i.

The number of loop iterations you run does not affect occupancy. It only affects runtime per block.

So you want twice the no. of blocks, each doing half the work? That may improve occupancy, depending on register use of your kernel. See the next posting for some details.

here’s how each 2nd block might iterate different halves of your loop (assuming n is even)

int which_half = blockIdx.x&1;
for(auto i = n/2*which_half; i < n/2*which_half + n/2; ++i){
    stuff
}

the way to increase occupancy is to limit the number of registers a kernel function uses (assuming absence of other limiting factors such as shared memory use per block)

One way to do so (globally for the .cu module) is to specify a -maxrregcount=N argument given to nvcc

The other (likely preferable) way is to use the launch_bounds() directive that tells the compiler how it can optimize for a given occupancy

For details see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds

Woops, I overlooked the i, its just a quick sample I typed out.

n is the same as the thread count, each thread processes all n giving 65536*65536 total iterations of the loop and both i and j indices are used within the loop.

What if I wanted to scale the division and how would I use the i index in the loop?

What index do I apply to i and j as in the revised op?

__global__ void DoStuff(float* stuff, int n, int nj){
    const auto i  = blockIdx.x * blockDim.x + threadIdx.x;
    const auto j0 = blockIdx.y * blockDim.y + threadIdx.y;
    for(auto j1 = 0; j1 < nj; ++j) {
        auto j = j0 * nj + j1;
        if (j >= n) break;
        auto x = stuff[i] - stuff[j];
    }
}

I tried using the y dimension but it doesn’t use any more warps, like the grid is executed in rows where the rows are not executed in parallel.

You need to launch the kernel in a suitable configuration:

unsigned int blocksize_x = 64;
unsigned int nj = 4;    // pick whichever value works best
dim3 gridDim((n+blocksize_x-1)/blocksize_x, nj, 1);
dim3 blockDim(blocksize_x, 1, 1);
DoStuff<<<gridDim,blockDim>>>(stuff, n, nj);

Disclaimer: written in browser, untested!

I tried 2 and 4 for the y dimension, what’s wrong with making the gridDim n/blocksize as I have always done?

Using n/blocksize for a one-dimensional gridDim is fine as long as n is a multiple of the blocksize.

Huh, actually a y dimension of 2 does utilise more warps and double occupancy to almost 100% but its slightly slower overall, even if I use less blocks to reduce occupancy slightly some warps are delayed according to Nsight Compute, it looks like I’m better off ignoring the occupancy situation.

The most efficient approach may be to launch a number of blocks equal to an integer multiple of the number of multiprocessors on the device (but no more than the device can launch at once due to occupancy limitations), and to use a “grid strided loop” to divide the work among the available blocks.

This should eliminate the problem of “delayed warps” (or trailing blocks for that matter)

Christian

Tried grid stride loops.