CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

One of the most common tasks in CUDA programming is to parallelize a loop using a kernel. As an example, let’s use our old friend SAXPY. Here’s the basic sequential implementation, which uses a for loop. To efficiently parallelize this, we need to launch enough threads to fully utilize the GPU. void saxpy(int n, float a,…

Thank you! I use this pattern everywhere now.

Thank you for the article. I always learn a lot from them.

Mark,

When running loops of such large size, do we need to copy arrays from global to shared memory to speed calculations up? I have a B x T array in global memory totaling 1.4 GB in size. I need to take chunks of 1 x T size from it and perform a convolution with another 1 x T array taken from a lookup table 16 MB in size also residing in global memory. It makes sense to use T threads to perform the convolution since the calculations don't overlap but somehow I am getting inconsistent results (sometimes it works and sometimes it does not). I will learn you technique and see if it works, but I am a bit lost in loops right now.

Thanks for any light you can shed on this matter.

Cheers,
Fabio.

Hi Fabio,
I think that these are independent concepts (when to use grid stride loops and when to use shared memory). Shared memory won't help speed up every computation in a loop -- just those that can benefit from reuse among threads of the same block. It's a part of the memory hierarchy.
Mark

Thanks, Mark for your prompt answer!
I verified your statement about how independent shared and global memories are. It then dawn on me that within a warp, any operation done on a piece of global memory is actually performed in some fast memory and that I wasn't gaining any extra time by copying it to shared memory first. In any case, what ended up causing the unexpected behavior were unused threads in the warp that I ensured to be inactive by an if statement.

Again, many thanks.
Fabio

Any arithmetic instruction is performed in the compute cores (ALUs), reading data from registers. I suppose that yes, registers are "some fast memory".

Hi Mark,
sorry for possible silly question:
does it make sense to use inverse external loop. I mean:
use i=threadIdx.x*gridDim.x+blockIdx.x instead of i=blockIdx.x * blockDim.x + threadIdx.x;
Does it always slower or any reasons don't do so?

Thank you,
Alexey

Think about which threads are running together. Threads 0 through 31, for example, will get values of `i` that are spread apart by gridDim.x. This means that if you index an array using `i` you will lose locality across parallel threads (this is important even in sequential loops on a CPU). Specifically, you will not get coalescing and each thread is likely to require a separate memory transaction (loading a whole cache line). Performance will suffer.

Thanks a lot, Mark!
All the best, Alexey

In the "grid-stride loop" example, would it be more efficient to store blockDim.x * gridDim.x in a register and use that register to increment i in the for loop?

If the compiler doesn't do that for you automatically then I would consider it a bug. Let me know if you find this is the case.

They appear to perform the same. :-)

Still a great article (I've come back to it multiple times for reference). I was wondering how one could implement this for Tensors that require more than 3-dimensions. Thanks again! -JJ

I'm also interested in the answer to this question. Did you figure it out?

Great discussion and now please describe for a 2D case with a simple example.

Mr. Mark,
1-I feel a bit confused about this statement of yours "Rather than assume that the thread grid is large enough to cover the entire data array,". I don't understand the word "assume" here because when you launch the grid-block to GPU, you must have the number of threads you need in your mind already (based on the data array), and therefore, know exactly know how many blocks you need for the syntax <<<m,t>>> to cover the entire array, don't you? If so, why "assume"?
2-Furthermore, to my knowledge, thread is a calculation that you create and it is different from a CUDA core (which is a physic element). For example, you create a grid of 512 threads, even though you might have only 256 cores, when you launch the grid, GPU still be able to calculate the whole 512 threads (by monolithic kernel). Not that 256 core will calculate only 256 threads and ignore the others. However, this blog: "https://alexminnaar.com/201..." explained in a way that threads and cores are the same. Can you please help to clarify this as well.
3-I have tested with nvprof, and seems like monolithic kernel is a bit faster than grid-stride because seems like for each thread, the latter has to calculate 2 more commands below that makes it a bit longer than the mono:
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
Maybe the way I tested is not correct, hope you can help to explain more.
I am just starting with CUDA, therefore, I am very grateful if you can help me to understand.
Thank you,

1. It's common to launch fewer threads than you have data items. Hence you need to iterate. You might even rely on (for example) the CUDA occupancy API to choose a block / grid size for you, which means you don't know until launch time how many threads you will launch.
2. You are correct. CUDA threads are not the same as CUDA cores. CUDA threads are threads of execution that stay resident and use resources (registers, e.g.) on a single multiprocessor until they finish the kernel. CUDA cores are physical instruction execution units on the GPU.
3. If you don't need a loop, then you can write it without a loop. If the index calculation slows the kernel down, then the kernel isn't doing much computation or memory access. :)

Mr. Mark,
Thank you very much for the fast and informative answers.
May I bother you further in the first point. I have tested and confirm that I misunderstood the point earlier.
Let say, if I launch <<<1,256>> with the monolith kernel for an array of 1M as in your tutorial.
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = a * x[i] + y[i];
}
--> Then GPU only calculates 256 threads, then stop.
Therefore, if use monolith, when launch <<<m,t>>> with Array size N. Make sure that m*t >= N.
I hope this comment will help other newbies like me to understand more.