I’m new to cuda and am trying to understand stride memory, and how threads are running on kernel.
-
For stride memory, I ran the example code in this link:
Unified Memory for CUDA Beginners | NVIDIA Technical Blog
from the run result,Max error is 0, which means all array elements are being processed (added).
How do all array elements get processed? The code looks like it skips every blockDim.x*gridDim.x (at for loop i += stride). How does the skipped element get processed?
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
} -
in this code, I noticed total number of threads (numBlocks*blockSize) are equal to number of elements in x and y array. Does this mean each thread process (adding x[i] and y[i]) once? If not, how do I make each thread process once?
-
How would the code look like if I want each thread to process 2 rounds? for example, first thread to process x[0]+y[0] on 1st round, and x[524288]+y[524288] on the second round.
Also, if there’s any article you’d recommend reading, please let me know. Thank you!!