Why might processing 4 elements per thread improve performance in a simple CUDA vector add kernel?

I’m learning CUDA optimization and trying to understand the performance implications of thread granularity in a simple vector addition kernel: c[i] = a[i] + b[i].

I’ve implemented two versions:

  1. Version A: Each thread handles 1 element.
__global__ void vectorAdd_one_per_thread(const float* a, const float* b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        c[idx] = a[idx] + b[idx];
    }
}
  1. Version B: Each thread handles 4 consecutive elements.
__global__ void vectorAdd_four_per_thread(const float* a, const float* b, float* c, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    for (int i = 0; i < 4; ++i) {
        int idx = i * 128 + tid;
        if (idx < N) {
            c[idx] = a[idx] + b[idx];
        }
    }
}

Both versions can coalesced memory .
I’d like to confirm:
1.Why do most open-source libraries use the second version?
2.What are the main advantages of processing 4 elements per thread?

The version B code doesn’t make any sense to me. Let’s assume a non-toy problem so a reasonable sized grid, let’s pick 8192 threads.

Therefore tid will range from 0 to 8191 across the grid.

Now let’s unpack the loop.

In the first iteration (i == 0), idx will take on values for each thread the same as tid, so that means the first 8192 elements will be processed, pretty much exactly the same as what version A would do, for the same size grid.

On the next loop iteration, (i == 1), the idx will take on values from 128 to 8191+128, across the grid. This will result in considerable overlap with the processing performed by loop iteration 0. I can’t imagine the utility of that (some elements will be processed twice, although the results should still be correct), and I doubt that that is prevalent in open-source libraries.

Normally if I saw “4 elements per thread” I would expect something like:

__global__ void vectorAdd_four_per_thread(const float* a, const float* b, float* c, int N) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < N/4){
      const float4 *a4 = reinterpret_cast<const float4 *> (a);
      const float4 *b4 = reinterpret_cast<const float4 *> (b);
      float4 *c4 = reinterpret_cast<float4 *> (c);
      const float4 ai = a4[tid];
      const float4 bi = b4[tid];
      float4 ci;
      ci.x = ai.x+bi.x;
      ci.y = ai.y+bi.y;
      ci.z = ai.z+bi.z;
      ci.w = ai.w+bi.w;
      c4[tid] = ci;
    }
}

There are various requirements and considerations for the above code, its not intended to be “ready to go for any possible use”. It’s just a sketch. Among other considerations, the starting point of the vectors would have to be naturally aligned for the float4 cast. In addition, there are granularity considerations (e.g. N should be whole-number divisible by 4) and dataset coverage considerations, that vary from what the previous versions might work with.