Summation

I have written a kernel for summation based on the guide from this website->http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=532&page=5

However the result is weird. Can someone help me please?

Here’s my code

#include<cutil.h>

#include<cstdio>

#include<device_functions.h>

#include<sm_11_atomic_functions.h>

#define N 16

#define blocksize 16

__global__ void summation(int* in, unsigned int size,int* total);

int main(int argc, char** argv)

{

	CUT_DEVICE_INIT(argc,argv);

	int *a_h,*total_h;//host

	int *a_d,*total_d;//device

	

	a_h = (int*) malloc(N*sizeof(int));

	total_h = (int*) malloc(sizeof(int));

	*total_h=0;

	cudaMalloc((void**) &a_d, sizeof(int)*N);

	cudaMalloc((void**) &total_d, sizeof(int));

	

	for (int i=0;i<N;i++)

	{

		a_h[i]=i;

	}

	cudaMemcpy(a_d, a_h, sizeof(int)*N, cudaMemcpyHostToDevice);

	

	summation<<<1,N>>>(a_d,N,total_d);

	

	cudaMemcpy(total_h, total_d, sizeof(int), cudaMemcpyDeviceToHost);

	printf("Total Sum: %d",*total_h);

	CUT_EXIT(argc,argv);

}

__global__ void summation(int* in, unsigned int size,int* total)

{

	unsigned int tid=threadIdx.x;

	unsigned int i=blockIdx.x*blockDim.x+threadIdx.x;

	__shared__ int x[blocksize];

	x[tid]=(i<size)?in[i]:0;

	__syncthreads();

	for(int s=blockDim.x/2;s>0;s/=2)

	{

		if(tid<s)

			x[tid]+=x[tid+s];

		__syncthreads();

	}

	if(tid==0)

		atomicAdd(total,x[tid]);

}

A quick scan shows good structure… you’re 99% of the way there.

Without trying it, I think the reason may be the fact that you initialized the host result answer_h to 0, but did NOT copy that to the device version.
So the device is doing an atomic add to some uninitialized value.

Try adding one line:
cudaMemcpy(answer_d, answer_h, sizeof(int), cudaMemcpyHostToDevice);

Thanks, but this time it returns me 0 instead :(

I tried adding the code CUT_CHECK_ERROR(“Kernel execution failed”);

after the function, and re-compiled. It ended up telling me that the kernel execution had failed. I reinstalled my CUDA toolkit and SDK as well as the graphics driver, and the final result is 1 instead of 0 or the actual result… (edit: the result is equals to the grid dimensions for some reason…)

This is driving me nuts…

In your for loop, aren’t you reading x[tid+s] at the same time another thread is writing to that location? tid = [1, 15], so you are writing to every location of x, meanwhile each thread may also be reading location x[tid + s] …

And is an atomicAdd instruction necessary, since only one thread is actually writing to that location (tid == 0)?

hmm I doubt that will affect coz There is a “if(tid<s)” just before it so it prevents other threads from accessing the same threads theoretically.

Atomic add is required if the sample size is bigger than one block (or the width of one block). If I remove the atomic add, it would return me 0 instead :(

Can anybody check whether is it that my GPU is acting weirdly or is it the problem with the code itself?

I’m not at my CUDA machine, so I can’t plug the code in right now.

summation<<<1,N>>>(a_d,N,total_d);

Do you need a third configuration parameter that specifies the amount of shared memory?

From the Programming Guide:

Not the most clear definition, so I may be wrong in my interpretation.

Tried, but it still doesnt work… =(

Ok finally got it working Thanks.

I just changed the
x[tid]+=x[tid+s]

to
x[tid]=20

then I got 20 as the result

then I tried

x[tid]=x[tid]+x[tid+s]

and it worked!

Strange. Anyone want to comment why this is? I’m a bit curious…

As a side note, I don’t ever use the ‘+=’ operator. I find it’s a lot easier for someone to read if I just write it out completely. But that’s beside the point.

ya, it is sort of strange. When I copied the exact code to another project, it gives me the same error again… seems like something is preventing the kernel to run. I cant even use the debugger to go to breakpoints in the host code in such situations…

EDIT
Seems like I found the solution.

you were right, during the function call, it should go like this instead

function<<<dimgrid,dimblock,arraysize>>>(…parameters…);

where in my case, arraysize=sizeof(int)*512 (512 is the maximum number of threads I can have in a block for my GPU)

then in the declaration,

instead of shared int x[blocksize],

I put extern shared int x;

EDIT