Summation

Reivax91 · November 17, 2008, 7:55am

I have written a kernel for summation based on the guide from this website->http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=532&page=5

However the result is weird. Can someone help me please?

Here’s my code

#include<cutil.h>

#include<cstdio>

#include<device_functions.h>

#include<sm_11_atomic_functions.h>

#define N 16

#define blocksize 16

__global__ void summation(int* in, unsigned int size,int* total);

int main(int argc, char** argv)

{

	CUT_DEVICE_INIT(argc,argv);

	int *a_h,*total_h;//host

	int *a_d,*total_d;//device

	

	a_h = (int*) malloc(N*sizeof(int));

	total_h = (int*) malloc(sizeof(int));

	*total_h=0;

	cudaMalloc((void**) &a_d, sizeof(int)*N);

	cudaMalloc((void**) &total_d, sizeof(int));

	

	for (int i=0;i<N;i++)

	{

		a_h[i]=i;

	}

	cudaMemcpy(a_d, a_h, sizeof(int)*N, cudaMemcpyHostToDevice);

	

	summation<<<1,N>>>(a_d,N,total_d);

	

	cudaMemcpy(total_h, total_d, sizeof(int), cudaMemcpyDeviceToHost);

	printf("Total Sum: %d",*total_h);

	CUT_EXIT(argc,argv);

}

__global__ void summation(int* in, unsigned int size,int* total)

{

	unsigned int tid=threadIdx.x;

	unsigned int i=blockIdx.x*blockDim.x+threadIdx.x;

	__shared__ int x[blocksize];

	x[tid]=(i<size)?in[i]:0;

	__syncthreads();

	for(int s=blockDim.x/2;s>0;s/=2)

	{

		if(tid<s)

			x[tid]+=x[tid+s];

		__syncthreads();

	}

	if(tid==0)

		atomicAdd(total,x[tid]);

}

SPWorley · November 17, 2008, 9:02am

A quick scan shows good structure… you’re 99% of the way there.

Without trying it, I think the reason may be the fact that you initialized the host result answer_h to 0, but did NOT copy that to the device version.
So the device is doing an atomic add to some uninitialized value.

Try adding one line:
cudaMemcpy(answer_d, answer_h, sizeof(int), cudaMemcpyHostToDevice);

Reivax91 · November 17, 2008, 2:23pm

Thanks, but this time it returns me 0 instead :(

Reivax91 · November 18, 2008, 9:35am

I tried adding the code CUT_CHECK_ERROR(“Kernel execution failed”);

after the function, and re-compiled. It ended up telling me that the kernel execution had failed. I reinstalled my CUDA toolkit and SDK as well as the graphics driver, and the final result is 1 instead of 0 or the actual result… (edit: the result is equals to the grid dimensions for some reason…)

This is driving me nuts…

bdg146psu · November 18, 2008, 1:55pm

In your for loop, aren’t you reading x[tid+s] at the same time another thread is writing to that location? tid = [1, 15], so you are writing to every location of x, meanwhile each thread may also be reading location x[tid + s] …

And is an atomicAdd instruction necessary, since only one thread is actually writing to that location (tid == 0)?

Reivax91 · November 18, 2008, 4:22pm

hmm I doubt that will affect coz There is a “if(tid<s)” just before it so it prevents other threads from accessing the same threads theoretically.

Atomic add is required if the sample size is bigger than one block (or the width of one block). If I remove the atomic add, it would return me 0 instead :(

Can anybody check whether is it that my GPU is acting weirdly or is it the problem with the code itself?

bdg146psu · November 18, 2008, 7:06pm

I’m not at my CUDA machine, so I can’t plug the code in right now.

summation<<<1,N>>>(a_d,N,total_d);

Do you need a third configuration parameter that specifies the amount of shared memory?

From the Programming Guide:

Not the most clear definition, so I may be wrong in my interpretation.

Reivax91 · November 19, 2008, 12:36pm

Tried, but it still doesnt work… =(

Reivax91 · November 19, 2008, 12:44pm

Ok finally got it working Thanks.

I just changed the
x[tid]+=x[tid+s]

to
x[tid]=20

then I got 20 as the result

then I tried

x[tid]=x[tid]+x[tid+s]

and it worked!

bdg146psu · November 19, 2008, 2:12pm

Strange. Anyone want to comment why this is? I’m a bit curious…

As a side note, I don’t ever use the ‘+=’ operator. I find it’s a lot easier for someone to read if I just write it out completely. But that’s beside the point.

Reivax91 · November 20, 2008, 7:12am

ya, it is sort of strange. When I copied the exact code to another project, it gives me the same error again… seems like something is preventing the kernel to run. I cant even use the debugger to go to breakpoints in the host code in such situations…

EDIT
Seems like I found the solution.

you were right, during the function call, it should go like this instead

function<<<dimgrid,dimblock,arraysize>>>(…parameters…);

where in my case, arraysize=sizeof(int)*512 (512 is the maximum number of threads I can have in a block for my GPU)

then in the declaration,

instead of shared int x[blocksize],

I put extern shared int x;

EDIT

Topic		Replies	Views
CUDA - calculation of a sum CUDA Programming and Performance	7	5530	April 30, 2010
computing a sum leads to infinite values CUDA Programming and Performance	3	5376	September 16, 2008
Summation of Big Array Parts Questions... CUDA Programming and Performance	1	6639	February 19, 2010
cuda reduction kernel from example doesn't run CUDA Programming and Performance	5	1570	February 13, 2013
device global memory update questions CUDA Programming and Performance	7	5844	April 20, 2009
CUDA BUG? atomicAdd CUDA Programming and Performance	1	6120	March 21, 2009
Wrong outputs for SUMMATION example SUMMATION problem in CUDA CUDA Programming and Performance	0	995	April 29, 2010
sequential sum within a kernel. CUDA Programming and Performance	23	4964	September 8, 2008
Correlation on GPU CUDA Programming and Performance	7	2053	October 23, 2010
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	854	July 16, 2010

Summation

Related topics