stride memory and thread management

mermaidLK · September 26, 2018, 6:56pm

I’m new to cuda and am trying to understand stride memory, and how threads are running on kernel.

For stride memory, I ran the example code in this link:
Unified Memory for CUDA Beginners | NVIDIA Technical Blog
from the run result,Max error is 0, which means all array elements are being processed (added).
How do all array elements get processed? The code looks like it skips every blockDim.x*gridDim.x (at for loop i += stride). How does the skipped element get processed?
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
in this code, I noticed total number of threads (numBlocks*blockSize) are equal to number of elements in x and y array. Does this mean each thread process (adding x[i] and y[i]) once? If not, how do I make each thread process once?
How would the code look like if I want each thread to process 2 rounds? for example, first thread to process x[0]+y[0] on 1st round, and x[524288]+y[524288] on the second round.

Also, if there’s any article you’d recommend reading, please let me know. Thank you!!

saulocpp · September 27, 2018, 11:36am

1 - It is not skipping any elements, try to think of it as SIMD (single instruction multiple data) logic. That is, each of these threads in each block have their own unique identification number, calculated by:

unique_idx = blockDim.x * blockIdx.x + threadIdx.x;

being strided by:

stride += gridDim.x * blockDim.x;

It means each thread will have a unique number and thus will operate on the respective array position. Thread 2 * 3 + 1 on element 2 * 3 + 1, then thread 2 * 6 + 5 on element 2 * 6 + 5.
As a beginner myself, I only understood it after reading over and over this article, https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, and many other related here in Devtalk and StackOverflow, AND doing it on PAPER. I’m confident if you do the same, you will get it.

2 - Yes, because of #1, each thread processes its respective element. The system keeps spawning threads until it reaches the limit of a test. For example:

for(size_t idx = blockDim.x * blockIdx.x + threadIdx.x; idx < array_len; idx += gridDim.x * blockDim.x)
    {
    // do something at array[idx]
    }

makes sure that any thread whose idx is smaller than the array length gets work to be done, and any thread exceeding this limit doesn’t do anything, otherwise it would cause an out-of-bounds error trying to access an element that doesn’t exist.

3 - Once you have the unique thread id calculated (idx), you can test if it is some number so it can do something specific:

for(size_t idx = blockDim.x * blockIdx.x + threadIdx.x; idx < len; idx += gridDim.x * blockDim.x)
    {
    if(idx == 0)
        // Do something with thread 0
    else if(idx == 734628736487)
        // Do something else with thread 734628736487 that is different than thread 0
    }

Just be aware that here you have thread divergence and it has its consequences. Some times you can’t avoid it, but I’m just giving you a warning that is exists and maybe you want to read about the subject later. The idea is, in a perfect world we are able to give exactly the same work to all of the threads and this is one of the objectives of parallel computing. But the world is not perfect.

Topic		Replies	Views
shared memory and threads question CUDA Programming and Performance	9	5935	August 30, 2007
Can I Control Thread ID? CUDA Programming and Performance	3	4369	June 9, 2008
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops Technical Blog	18	1611	September 12, 2019
An example ' each thread access one or more elements ' CUDA Programming and Performance	9	1191	April 29, 2014
Thread Batching and Memory model CUDA Programming and Performance	9	7096	November 14, 2007
questions about the NVIDIA programming model and GPU architecture newbie in here.... CUDA Programming and Performance	3	2475	November 10, 2008
passing different sizes arrays data to shared variables CUDA Programming and Performance	9	1628	May 4, 2010
What does threadIdx.x+blockDim.x mean? CUDA Programming and Performance cuda	3	836	September 16, 2022
do not understand thread/block division CUDA Programming and Performance	10	2799	April 23, 2012
Threads, Blocks & Grid in CUDA CUDA Programming and Performance	4	10477	October 4, 2011

stride memory and thread management

Related topics