How to achieve a[1]+a[2], a[3]+a[4], a[5]+a[6] ........ using CUDA

865463381 · March 31, 2018, 2:20pm

Hi all,

I confuse on how to add adjacent 2 elements in an array, and stride is 2.
If the array length is N, so there will be N threads right? And the i_th thread uses threadIdx.x can access a[i] right?
I don't know how to achieve this assignment.
Can anyone help me?  Thank you very much.

alexvk · April 1, 2018, 10:22pm

If I understand the question correctly, you want to run a reduce function summing a[2i] and a[2i+1]? There are obviously multiple ways to do this in CUDA or even inherently parallelizing functional programming languages like Scala, but for CUDA C I would write a simple kernel, which obviously parallelizes on the output b values:

typedef double VECTOR;

__global__
void reduce(VECTOR *w, VECTOR *v, size_t n)
{
  size_t index = blockIdx.x * blockDim.x + threadIdx.x;
  const size_t stride = blockDim.x * gridDim.x;

  do
    {
      w[index] = v[index<<1] + v[(index<<1)+1];
      index += stride;
    }
  while(index < n);
}

which I would call from the host code:

#include <cuda_runtime_api.h>

#define CUDACHECK(cmd) do { \
    cudaError_t e = cmd;    \
    if( e != cudaSuccess ) { \
    printf("Failed: Cuda error %s:%d '%s'\n", \
        __FILE__,__LINE__,cudaGetErrorString(e)); \
    exit(EXIT_FAILURE);     \
  } \
} while(0)

const size_t N = ...;
const unsigned blockSize = (1u<<7);

VECTOR *a, *b, *result;

result = (VECTOR*)malloc(N*sizof(VECTOR));

CUDACHECK(cudaMalloc((void**)&b, N*sizeof(VECTOR)));
CUDACHECK(cudaMallocHost((void**)&a, 2*N*sizeof(VECTOR)));

// Initialize a or cudaMemcpy((void*)a, ..., cudaMemcpyHostToDevice)

reduce<<<(N + blockSize - 1)/blockSize, blockSize>>>(b, a, N);

CUDACHECK(cudaDeviceSynchronize()));

CUDACHECK(cudaMemcpy((void*)result, (void*)b, N*sizeof(VECTOR), cudaMemcpyDeviceToHost));

CUDACHECK(cudaFreeHost(a));
CUDACHECK(cudaFree(b));

There is definitely room for optimization both on readability side (by using cudaMallocManaged() instead) and performance side (by using user defined streams and async programming), but the pattern is the same: copy the data to the device, run kernel, copy the data back.

To go back to your question, the stride in the kernel has nothing to do with the stride of the reduce operation.

Robert_Crovella · April 2, 2018, 2:17pm

cross posting here:

[url]gpu - How to achieve a[1]+a[2], a[3]+a[4], a[5]+a[6] ........ using CUDA - Stack Overflow

865463381 · April 3, 2018, 11:20am

Hi alexvk，

Thank you very much!

Topic		Replies	Views
A "simple" question CUDA Programming and Performance	2	1495	October 30, 2007
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1420	October 30, 2022
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	854	July 16, 2010
Parallelize function which will count all vectors with sum equal of vector elements and elements not CUDA Programming and Performance	1	678	October 19, 2013
Reduction an array in to 10 elements by addition of elements based on remaining the indexes to 10 CUDA Setup and Installation	1	475	February 20, 2017
Array Comparision CUDA Programming and Performance	4	4242	May 31, 2009
Questions about a double sum code optimization. CUDA Programming and Performance	0	1152	January 15, 2010
Dealing with strided data-arrays (perhaps with Thrust or beyond it) CUDA Programming and Performance cuda , kernel	0	472	May 3, 2023
atomic add operation CUDA Programming and Performance	2	4375	July 22, 2014
Using reduction instead of atomics? CUDA Programming and Performance	9	5756	March 9, 2015

How to achieve a[1]+a[2], a[3]+a[4], a[5]+a[6] ........ using CUDA

Related topics