CUDA - calculation of a sum

Hi ladies and gents!

I have problem with sum calculation in CUDA. I have an struct which contains an array of 784 elements and i need to calculate it’s sum.

I tried to modify the code from nvidia reduction example, but i still get wrong sum. Maybe I used bad block and thread dimensions.

Please, can anybody post a small kernel function which will compute the sum of this array? - please include blockidx and threadidx dimensions.

Or any help how to do this will help me.

Thanks in advance.

PS: sorry for my English, it’s not my native language.

I tried to figure it out on a simple example which will calculate sum of 16 elements, but i again get wrong result

here is the code:

[codebox]

#include <stdio.h>

#include <stdlib.h>

#include <cutil_inline.h>

#include <cuda.h>

global void

reduce0(float* g_idata,float* g_odata, unsigned int n)

{

extern __shared__ float temp[];

int thid = threadIdx.x;

int pout = 0, pin = 1;

temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;

__syncthreads();

for(int offset = 1;offset < n; offset *= 2)

{

	pout = 1 - pout;

	pin = 1 - pout;

	if(thid >= offset)

		temp[pout*n+thid] += temp[pin*n+thid - offset];

	else

		temp[pout*n+thid] = temp[pin*n+thid];

	__syncthreads();

}



g_odata[thid] = temp[pout*n + thid];

}

int main(int argc, char **argv)

{

float *data;

float *odata;

data=(float *)malloc((16)*sizeof(float));

odata=(float *)malloc((16)*sizeof(float));

for(int i=0;i<16;i++){data[i] = (float)i/100; printf(“data[%d]=%f\n”,i,data[i]);}

float *g_idata;

float *g_odata;

cutilSafeCall( cudaMalloc( (void**) &g_idata, (16)*sizeof(float)));

cutilSafeCall( cudaMalloc( (void**) &g_odata, (16)*sizeof(float)));

cutilSafeCall( cudaMemcpy( g_idata, data, 16*sizeof(float), cudaMemcpyHostToDevice) );

reduce0<<<2, 8>>>(g_idata,g_odata, 16);

cudaMemcpy(odata, g_odata, 16*sizeof(float), cudaMemcpyDeviceToHost);

for(int i=0;i<16;i++){printf(“data[%d]=%f\n”,i,odata[i]);}

system(“PAUSE”);

}

[/codebox]

Hi !!!

Wow, I am having the same problem as well. Can someone please help us … It works well for me in emulation mode, but not on the GPU … I have been hitting my head over this on the table for a few hours now and I don’t know who to fix it. I posted my problem and code earlier.

Thanks,

I haven’t looked closely at your code, but right off the bat, you should need a cudaThreadSynchronize() to block the kernels queued up before you copy back the results.

You don’t need cudaThreadSynchronize to copy back the results.
What you do need however is to reserve memory for your extern shared variable. The kernel call goes like this <<<grid,block,shmem>>>, where shmem is amount of shared memory (in bytes) reserved for your extern shared variables (in your case: temp).
Because the value is ommited, 0 is assumed and temp most likely overwrites other variables (and other variables overwrite temp)

I’d recommend looking at the original white paper
http://developer.download.nvidia.com/compu…an/doc/scan.pdf

Another thing that was apparent that you’re calling 2 blocks of blocksize 8. It states in the paper:
“CUDA C code for the naive scan algorithm. This version
can handle arrays only as large as can be processed by a single
thread block running on one multiprocessor of a GPU.”

So you’d have to change the way you call it as well.

Thank you for your reply. I don know if I am doing it right…

I called the kernel with <<<30,240,16*sizeof(float)>>> but the result i got was #QNAN0

Am I doing something wrong?

Feel free to edit my code, if you need to.

You should really go through the programming guide again, if you haven’t already.

The parameters when calling a kernel are: << num_blocks_in_grid, num_threads_in_block, shared_mem_size >>

So given that the example code says its only supposed to work with 1 block with as many threads are there are elements, you have to specificy 1 block with, in your example, 16 threads in that block. As Cygnus explained, you also need the shared memory size, which in this case would be 16*sizeof(float).

I’ve cut your kernel down a bit to make the point more obvious,

__global__ void

reduce0(float* g_idata,float* g_odata, unsigned int n) {

	extern __shared__ float temp[];

	int thid = threadIdx.x;

	temp[thid] = g_idata[thid];

	__syncthreads();

	

	for(int offset = 1;offset < n; offset *= 2) {

		if(thid >= offset)

			temp[thid] += temp[thid - offset];

		__syncthreads();

	}

	g_odata[thid] = temp[thid];

}

Call it with: reduce0<<<1, 16, 16*sizeof(float) >>>

xmonraz thank you very much. Your code worked perfectly.
This helped me a lot. :">

Now i am going to read again programing guide, to find more answers O:)
Thanks all of you for your patience.