CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops

jwitsoe · August 29, 2013, 9:02pm

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

One of the most common tasks in CUDA programming is to parallelize a loop using a kernel. As an example, let’s use our old friend SAXPY. Here’s the basic sequential implementation, which uses a for loop. To efficiently parallelize this, we need to launch enough threads to fully utilize the GPU. void saxpy(int n, float a,…

anon72194872 · March 14, 2015, 9:45pm

Thank you! I use this pattern everywhere now.

anon69734343 · April 14, 2015, 5:10pm

Thank you for the article. I always learn a lot from them.

anon28841303 · August 6, 2016, 5:40pm

Mark,

When running loops of such large size, do we need to copy arrays from global to shared memory to speed calculations up? I have a B x T array in global memory totaling 1.4 GB in size. I need to take chunks of 1 x T size from it and perform a convolution with another 1 x T array taken from a lookup table 16 MB in size also residing in global memory. It makes sense to use T threads to perform the convolution since the calculations don't overlap but somehow I am getting inconsistent results (sometimes it works and sometimes it does not). I will learn you technique and see if it works, but I am a bit lost in loops right now.

Thanks for any light you can shed on this matter.

Cheers,
Fabio.

anon95180265 · August 8, 2016, 4:13am

Hi Fabio,
I think that these are independent concepts (when to use grid stride loops and when to use shared memory). Shared memory won't help speed up every computation in a loop -- just those that can benefit from reuse among threads of the same block. It's a part of the memory hierarchy.
Mark

anon28841303 · August 8, 2016, 4:35pm

Thanks, Mark for your prompt answer!
I verified your statement about how independent shared and global memories are. It then dawn on me that within a warp, any operation done on a piece of global memory is actually performed in some fast memory and that I wasn't gaining any extra time by copying it to shared memory first. In any case, what ended up causing the unexpected behavior were unused threads in the warp that I ensured to be inactive by an if statement.

Again, many thanks.
Fabio

anon95180265 · August 9, 2016, 5:12am

Any arithmetic instruction is performed in the compute cores (ALUs), reading data from registers. I suppose that yes, registers are "some fast memory".

anon36428746 · February 2, 2017, 5:56pm

Hi Mark,
sorry for possible silly question:
does it make sense to use inverse external loop. I mean:
use i=threadIdx.x*gridDim.x+blockIdx.x instead of i=blockIdx.x * blockDim.x + threadIdx.x;
Does it always slower or any reasons don't do so?

Thank you,
Alexey

anon95180265 · February 2, 2017, 10:36pm

Think about which threads are running together. Threads 0 through 31, for example, will get values of `i` that are spread apart by gridDim.x. This means that if you index an array using `i` you will lose locality across parallel threads (this is important even in sequential loops on a CPU). Specifically, you will not get coalescing and each thread is likely to require a separate memory transaction (loading a whole cache line). Performance will suffer.

anon36428746 · February 2, 2017, 10:40pm

Thanks a lot, Mark!
All the best, Alexey

anon3778231 · February 22, 2017, 6:12am

In the "grid-stride loop" example, would it be more efficient to store blockDim.x * gridDim.x in a register and use that register to increment i in the for loop?

anon95180265 · February 22, 2017, 10:59pm

If the compiler doesn't do that for you automatically then I would consider it a bug. Let me know if you find this is the case.

anon3778231 · February 23, 2017, 5:00am

They appear to perform the same. :-)

anon16208433 · May 8, 2018, 12:02pm

Still a great article (I've come back to it multiple times for reference). I was wondering how one could implement this for Tensors that require more than 3-dimensions. Thanks again! -JJ

anon96909095 · September 4, 2018, 4:33pm

I'm also interested in the answer to this question. Did you figure it out?

anon99550765 · April 4, 2019, 11:06pm

Great discussion and now please describe for a 2D case with a simple example.

anon42566301 · September 11, 2019, 8:45am

Mr. Mark,
1-I feel a bit confused about this statement of yours "Rather than assume that the thread grid is large enough to cover the entire data array,". I don't understand the word "assume" here because when you launch the grid-block to GPU, you must have the number of threads you need in your mind already (based on the data array), and therefore, know exactly know how many blocks you need for the syntax <<<m,t>>> to cover the entire array, don't you? If so, why "assume"?
2-Furthermore, to my knowledge, thread is a calculation that you create and it is different from a CUDA core (which is a physic element). For example, you create a grid of 512 threads, even though you might have only 256 cores, when you launch the grid, GPU still be able to calculate the whole 512 threads (by monolithic kernel). Not that 256 core will calculate only 256 threads and ignore the others. However, this blog: "https://alexminnaar.com/201..." explained in a way that threads and cores are the same. Can you please help to clarify this as well.
3-I have tested with nvprof, and seems like monolithic kernel is a bit faster than grid-stride because seems like for each thread, the latter has to calculate 2 more commands below that makes it a bit longer than the mono:
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
Maybe the way I tested is not correct, hope you can help to explain more.
I am just starting with CUDA, therefore, I am very grateful if you can help me to understand.
Thank you,

anon95180265 · September 12, 2019, 2:11am

1. It's common to launch fewer threads than you have data items. Hence you need to iterate. You might even rely on (for example) the CUDA occupancy API to choose a block / grid size for you, which means you don't know until launch time how many threads you will launch.
2. You are correct. CUDA threads are not the same as CUDA cores. CUDA threads are threads of execution that stay resident and use resources (registers, e.g.) on a single multiprocessor until they finish the kernel. CUDA cores are physical instruction execution units on the GPU.
3. If you don't need a loop, then you can write it without a loop. If the index calculation slows the kernel down, then the kernel isn't doing much computation or memory access. :)

anon42566301 · September 12, 2019, 7:20am

Mr. Mark,
Thank you very much for the fast and informative answers.
May I bother you further in the first point. I have tested and confirm that I misunderstood the point earlier.
Let say, if I launch <<<1,256>> with the monolith kernel for an array of 1M as in your tutorial.
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = a * x[i] + y[i];
}
--> Then GPU only calculates 256 threads, then stop.
Therefore, if use monolith, when launch <<<m,t>>> with Array size N. Make sure that m*t >= N.
I hope this comment will help other newbies like me to understand more.

Topic		Replies	Views
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5982	September 5, 2008
An Easy Introduction to CUDA C and C++ Technical Blog	48	1256	July 19, 2018
Where's a PTX ISA Virtual Grid ID? Special Reg %gridid is "temporal".. CUDA Programming and Performance	7	2354	January 23, 2012
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4498	October 24, 2008
'for' loop performance hacks? CUDA Programming and Performance	17	10579	February 28, 2009
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10600	December 24, 2008
Code optimization with CDP and dynamic shared memory allocation CUDA Programming and Performance	18	67	January 13, 2025
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010

CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops

Related topics