Summing array elements using kernel Access frome the whole block grid

kelt_rivera · July 16, 2010, 10:46am

Hello!

I’m trying to summ all elements of array a_h with dimension arr_size using CUDA.

As a sample, all array elements are 1.0, so an answer must be exactly arr_size. But it shows just small inadequate values which depends of number of blocks per grid.

Why it could be so? I’ve tried everything including device variables etc, but it doesn’t work! Can anyone help me?

Every thread in block is executed and also block have a proper size. May be result_d pointer is different for ech block or something about it?..

Thanks very much/ External Image

The code is following:

/////////////////////////////////////////////////////////////////////////////////////////////////////////
#include <stdio.h>

//summing kernel

global void test_kernel(float *a_d, float *result_d, int array_size){

int index=blockDim.x*blockIdx.x+threadIdx.x;

if(index<array_size){
	*result_d+=a_d[index];
};

};

int main(void){

int arr_size=1000;

float *a_h,*result_h,*a_d,*result_d;

//allocation of arrays and result variables in both memories

cudaMalloc(&a_d,sizeof(float)*arr_size);
cudaMalloc(&result_d,sizeof(float));

a_h=new float[arr_size];
result_h=new float;

*result_h=(float)0.0;

for(int i=0;i<arr_size;i++){
	a_h[i]=(float)1.0;
};

//coping array and zero result variable to device memory

cudaMemcpy(a_d,a_h,sizeof(float)*arr_size,cudaMemcpyHostToDe

vice);
cudaMemcpy(result_d,result_h,sizeof(float),cudaMemcpyHostToD
evice);

int threads_p_block=256;

int blocks_count=arr_size/threads_p_block+1;

test_kernel<<<blocks_count,threads_p_block>>>(a_d,result_d,arr_size);

cudaMemcpy(result_h,result_d,sizeof(float),cudaMemcpyDeviceT

oHost);

printf("result=%f\n",*result_h);

return 1;

};
///////////////////////////////////////////////////////////////////////////////////////////////////////////

marcuse · July 16, 2010, 11:26am

The problem is that the threads execute concurrently (at least part of them), so most of them reach the line:

*result_d+=a_d[index];

at the same time, so they all read the same value (1.0) before any other thread had changed it, and as a result all the concurrent threads do the same operation:

result_d = 1 + 0

The solution is to use atomic operations, blocking the variable so only one thread can operate with it at a time (fatal to performance) or to use a reduction to the sum method. There are plenty of examples of this method in the net. Anyway, this problem is not one of those GPUs feel more comfortable with.

kelt_rivera · July 16, 2010, 6:11pm

Thanks for your answer! External Image Yeah…it’s really bad thing for perfomance. Looks like this program will become about to serial in this case…(1 thread per time operations)

If it’s not difficult for you, can you show some VERY SIMPLE example of a “reduction of the sum method”. Haven’t understood how one can do it…(anything is clear about blocking)

P.S. Good, that this is just initialization for the main algorithm, so CUDA is still very actual for me ^_^

kelt_rivera · July 16, 2010, 6:24pm

Found that into a FAQ…

[i]How do I compute the sum of an array of numbers on the GPU?

This is known as a parallel reduction operation. See the “reduction” sample in the CUDA SDK for more details.[/i]

Sorry for the stupid question… External Image

Topic		Replies	Views
How to sum all the elements of an array CUDA Programming and Performance	4	30367	April 6, 2011
CUDA - calculation of a sum CUDA Programming and Performance	7	5408	April 30, 2010
Calculation sum of array parts have large prime number elements CUDA Programming and Performance	5	1845	December 23, 2009
Summing matrix elements CUDA Programming and Performance	3	6918	July 4, 2011
Reduction operation returns incorrect result CUDA Programming and Performance	1	408	November 18, 2018
warp aggregated atomics result CUDA Programming and Performance	2	781	December 8, 2017
Easyway to compute the sum of the array? CUDA Programming and Performance	4	8015	February 13, 2008
Unable to access the correct matrix elements through threads CUDA Programming and Performance	5	682	May 27, 2017
simple question CUDA Programming and Performance	5	737	August 2, 2011
syncthreads() and += operator... CUDA Programming and Performance	6	6322	December 20, 2009

Summing array elements using kernel Access frome the whole block grid

Related topics