I’m learning CUDA optimization and trying to understand the performance implications of thread granularity in a simple vector addition kernel: c[i] = a[i] + b[i].
I’ve implemented two versions:
Version A: Each thread handles 1 element.
__global__ void vectorAdd_one_per_thread(const float* a, const float* b, float* c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
c[idx] = a[idx] + b[idx];
}
}
Version B: Each thread handles 4 consecutive elements.
__global__ void vectorAdd_four_per_thread(const float* a, const float* b, float* c, int N) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = 0; i < 4; ++i) {
int idx = i * 128 + tid;
if (idx < N) {
c[idx] = a[idx] + b[idx];
}
}
}
Both versions can coalesced memory .
I’d like to confirm:
1.Why do most open-source libraries use the second version?
2.What are the main advantages of processing 4 elements per thread?
The version B code doesn’t make any sense to me. Let’s assume a non-toy problem so a reasonable sized grid, let’s pick 8192 threads.
Therefore tid will range from 0 to 8191 across the grid.
Now let’s unpack the loop.
In the first iteration (i == 0), idx will take on values for each thread the same as tid, so that means the first 8192 elements will be processed, pretty much exactly the same as what version A would do, for the same size grid.
On the next loop iteration, (i == 1), the idx will take on values from 128 to 8191+128, across the grid. This will result in considerable overlap with the processing performed by loop iteration 0. I can’t imagine the utility of that (some elements will be processed twice, although the results should still be correct), and I doubt that that is prevalent in open-source libraries.
Normally if I saw “4 elements per thread” I would expect something like:
There are various requirements and considerations for the above code, its not intended to be “ready to go for any possible use”. It’s just a sketch. Among other considerations, the starting point of the vectors would have to be naturally aligned for the float4 cast. In addition, there are granularity considerations (e.g. N should be whole-number divisible by 4) and dataset coverage considerations, that vary from what the previous versions might work with.