stride memory and thread management

I’m new to cuda and am trying to understand stride memory, and how threads are running on kernel.

  1. For stride memory, I ran the example code in this link:
    https://devblogs.nvidia.com/unified-memory-cuda-beginners/
    from the run result,Max error is 0, which means all array elements are being processed (added).
    How do all array elements get processed? The code looks like it skips every blockDim.x*gridDim.x (at for loop i += stride). How does the skipped element get processed?
    void add(int n, float *x, float *y)
    {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
    }

  2. in this code, I noticed total number of threads (numBlocks*blockSize) are equal to number of elements in x and y array. Does this mean each thread process (adding x[i] and y[i]) once? If not, how do I make each thread process once?

  3. How would the code look like if I want each thread to process 2 rounds? for example, first thread to process x[0]+y[0] on 1st round, and x[524288]+y[524288] on the second round.

Also, if there’s any article you’d recommend reading, please let me know. Thank you!!

1 - It is not skipping any elements, try to think of it as SIMD (single instruction multiple data) logic. That is, each of these threads in each block have their own unique identification number, calculated by:

unique_idx = blockDim.x * blockIdx.x + threadIdx.x;

being strided by:

stride += gridDim.x * blockDim.x;

It means each thread will have a unique number and thus will operate on the respective array position. Thread 2 * 3 + 1 on element 2 * 3 + 1, then thread 2 * 6 + 5 on element 2 * 6 + 5.
As a beginner myself, I only understood it after reading over and over this article, https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, and many other related here in Devtalk and StackOverflow, AND doing it on PAPER. I’m confident if you do the same, you will get it.

2 - Yes, because of #1, each thread processes its respective element. The system keeps spawning threads until it reaches the limit of a test. For example:

for(size_t idx = blockDim.x * blockIdx.x + threadIdx.x; idx < array_len; idx += gridDim.x * blockDim.x)
    {
    // do something at array[idx]
    }

makes sure that any thread whose idx is smaller than the array length gets work to be done, and any thread exceeding this limit doesn’t do anything, otherwise it would cause an out-of-bounds error trying to access an element that doesn’t exist.

3 - Once you have the unique thread id calculated (idx), you can test if it is some number so it can do something specific:

for(size_t idx = blockDim.x * blockIdx.x + threadIdx.x; idx < len; idx += gridDim.x * blockDim.x)
    {
    if(idx == 0)
        // Do something with thread 0
    else if(idx == 734628736487)
        // Do something else with thread 734628736487 that is different than thread 0
    }

Just be aware that here you have thread divergence and it has its consequences. Some times you can’t avoid it, but I’m just giving you a warning that is exists and maybe you want to read about the subject later. The idea is, in a perfect world we are able to give exactly the same work to all of the threads and this is one of the objectives of parallel computing. But the world is not perfect.