How to achieve a[1]+a[2], a[3]+a[4], a[5]+a[6] ........ using CUDA

Hi all,

I confuse on how to add adjacent 2 elements in an array, and stride is 2.
If the array length is N, so there will be N threads right? And the i_th thread uses threadIdx.x can access a[i] right?
I don't know how to achieve this assignment.
Can anyone help me?  Thank you very much.

If I understand the question correctly, you want to run a reduce function summing a[2i] and a[2i+1]? There are obviously multiple ways to do this in CUDA or even inherently parallelizing functional programming languages like Scala, but for CUDA C I would write a simple kernel, which obviously parallelizes on the output b values:

typedef double VECTOR;

__global__
void reduce(VECTOR *w, VECTOR *v, size_t n)
{
  size_t index = blockIdx.x * blockDim.x + threadIdx.x;
  const size_t stride = blockDim.x * gridDim.x;

  do
    {
      w[index] = v[index<<1] + v[(index<<1)+1];
      index += stride;
    }
  while(index < n);
}

which I would call from the host code:

#include <cuda_runtime_api.h>

#define CUDACHECK(cmd) do { \
    cudaError_t e = cmd;    \
    if( e != cudaSuccess ) { \
    printf("Failed: Cuda error %s:%d '%s'\n", \
        __FILE__,__LINE__,cudaGetErrorString(e)); \
    exit(EXIT_FAILURE);     \
  } \
} while(0)

const size_t N = ...;
const unsigned blockSize = (1u<<7);

VECTOR *a, *b, *result;

result = (VECTOR*)malloc(N*sizof(VECTOR));

CUDACHECK(cudaMalloc((void**)&b, N*sizeof(VECTOR)));
CUDACHECK(cudaMallocHost((void**)&a, 2*N*sizeof(VECTOR)));

// Initialize a or cudaMemcpy((void*)a, ..., cudaMemcpyHostToDevice)

reduce<<<(N + blockSize - 1)/blockSize, blockSize>>>(b, a, N);

CUDACHECK(cudaDeviceSynchronize()));

CUDACHECK(cudaMemcpy((void*)result, (void*)b, N*sizeof(VECTOR), cudaMemcpyDeviceToHost));

CUDACHECK(cudaFreeHost(a));
CUDACHECK(cudaFree(b));

There is definitely room for optimization both on readability side (by using cudaMallocManaged() instead) and performance side (by using user defined streams and async programming), but the pattern is the same: copy the data to the device, run kernel, copy the data back.

To go back to your question, the stride in the kernel has nothing to do with the stride of the reduce operation.

cross posting here:

[url]gpu - How to achieve a[1]+a[2], a[3]+a[4], a[5]+a[6] ........ using CUDA - Stack Overflow

Hi alexvk,

Thank you very much!