syncthreads() and += operator...

enginkanal · December 19, 2009, 10:42pm

Hi Everyone;

I am new about cuda and I tried to write first cuda code using += operator.

The kernel is following:

[codebox]global void SumArray(float *a, float *c, const unsigned int N)

{

unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;

unsigned int j = threadIdx.y + blockIdx.y * blockDim.y;

unsigned int index = i * N + j;

float cSub = 0;

cSub += a[index];

__syncthreads();

c[0] = cSub;

return;

}[/codebox]

The kernel returns cSub = 1 (all elements of a are 1).Each thread has

own cSub value so the results might be logic, but I use syncthreads()

commands for this situation. I have Quaro FX 5600 and I use Cuda 2.0.

Thanks for advices…

LSChien · December 20, 2009, 2:00am

your kernel is equivalent to following one

__global__ void SumArray(float *a, float *c, const unsigned int N)

{	

	unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;	

	unsigned int j = threadIdx.y + blockIdx.y * blockDim.y;	

	unsigned int index = i * N + j;  

	c[0] = a[index];

this is not summation of an array, just write all elements of “a” to first element of “c”,

race condition occurs.

enginkanal · December 20, 2009, 2:11am

Thanks for explanation. So how can I calculate sum of all elements of an array?

I looked the reduction example in sdk. But it doesn t help me because I put some complex
conditions for sum operator.

Thanks again…

LSChien · December 20, 2009, 2:13am

what’s your complex conditions?

enginkanal · December 20, 2009, 2:48am

In my algorithm, there is a main for loop calculating result of some parts of the array.
In each loop, location and length of the parts are changing. For example
(it is just an example loops and values)

for (k = 0; k<4; k++)
{
total = 0;
mStart = k * 12;
mStep = k * 36;
for (m = mStart; m<arraySize; m += mStep)
{
Boundary = m + 6;
for (Index = m; Index < Boundary; Index++)
total += array[Index];

}
totalArray[k] = total;
}

LSChien · December 20, 2009, 4:28am

The simplest way is to use one thread block to deal with one “k”, so

you need 4 thread block ( this means only 4 SM are used).

you can use one thread block (32 thread per block) to do reduction, the same as in SDK reduction example,

__shared__ float partial_sum[32];

partial_sum[ threadIdx.x ] = 0.0f;

k = blockIdx.x;

	mStart = k * 12;

	mStep = k * 36;

	for (m = mStart + threadIdx.x; m < arraySize; m += mStep)

	{

		partial_sum[ threadIdx.x ] += array[ m ];

	}

	

// then do reduction for 6 elements, please see reduction example

However this method has several drawbacks

(1) only 4 SM are used, that means that bandwidth is only 4/8 = 1/2

(2) each thread block has 32 threads only and each SM has one thread block.

This means that one SM has only 32 active threads, this cannot hide pipeline latency

(to hide pipeline latency, one SM must have 192 threads)

(3) although you update 32 elements in shared memory, only 6 among them are what you want.

So effective bandwidth is 6/32.

Combine (1) and (2), this method reaches ( 1/4 ) * (6/32) = 1/20 bandwidth at most,

if you use Tesla C1060, that bandwidth is about 5GB/s, maybe slightly better than CPU code.

enginkanal · December 20, 2009, 4:34am

Thank you so much for your quick help…

Topic		Replies	Views
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	854	July 16, 2010
Calculation sum of array parts have large prime number elements CUDA Programming and Performance	5	1847	December 23, 2009
CUDA - calculation of a sum CUDA Programming and Performance	7	5531	April 30, 2010
Cuda Speedup CUDA Programming and Performance	3	2237	October 20, 2009
Unable to access the correct matrix elements through threads CUDA Programming and Performance	5	685	May 27, 2017
How to sum all the elements of an array CUDA Programming and Performance	4	30482	April 6, 2011
total sum example CUDA Programming and Performance	3	7216	December 2, 2015
Array Sum in cuda CUDA Programming and Performance	5	11482	May 30, 2010
Is there any alternative for __syncthreads() Your reply would help a lot CUDA Programming and Performance	2	4600	April 7, 2010
Easyway to compute the sum of the array? CUDA Programming and Performance	4	8016	February 13, 2008

syncthreads() and += operator...

Related topics