CUDA - calculation of a sum

naryan · April 29, 2010, 6:49pm

Hi ladies and gents!

I have problem with sum calculation in CUDA. I have an struct which contains an array of 784 elements and i need to calculate it’s sum.

I tried to modify the code from nvidia reduction example, but i still get wrong sum. Maybe I used bad block and thread dimensions.

Please, can anybody post a small kernel function which will compute the sum of this array? - please include blockidx and threadidx dimensions.

Or any help how to do this will help me.

Thanks in advance.

PS: sorry for my English, it’s not my native language.

I tried to figure it out on a simple example which will calculate sum of 16 elements, but i again get wrong result

here is the code:

[codebox]

#include <stdio.h>

#include <stdlib.h>

#include <cutil_inline.h>

#include <cuda.h>

global void

reduce0(float* g_idata,float* g_odata, unsigned int n)

{

extern __shared__ float temp[];

int thid = threadIdx.x;

int pout = 0, pin = 1;

temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;

__syncthreads();

for(int offset = 1;offset < n; offset *= 2)

{

	pout = 1 - pout;

	pin = 1 - pout;

	if(thid >= offset)

		temp[pout*n+thid] += temp[pin*n+thid - offset];

	else

		temp[pout*n+thid] = temp[pin*n+thid];

	__syncthreads();

}



g_odata[thid] = temp[pout*n + thid];

}

int main(int argc, char **argv)

{

float *data;

float *odata;

data=(float *)malloc((16)*sizeof(float));

odata=(float *)malloc((16)*sizeof(float));

for(int i=0;i<16;i++){data[i] = (float)i/100; printf(“data[%d]=%f\n”,i,data[i]);}

float *g_idata;

float *g_odata;

cutilSafeCall( cudaMalloc( (void**) &g_idata, (16)*sizeof(float)));

cutilSafeCall( cudaMalloc( (void**) &g_odata, (16)*sizeof(float)));

cutilSafeCall( cudaMemcpy( g_idata, data, 16*sizeof(float), cudaMemcpyHostToDevice) );

reduce0<<<2, 8>>>(g_idata,g_odata, 16);

cudaMemcpy(odata, g_odata, 16*sizeof(float), cudaMemcpyDeviceToHost);

for(int i=0;i<16;i++){printf(“data[%d]=%f\n”,i,odata[i]);}

system(“PAUSE”);

}

[/codebox]

cudaguysjsu · April 29, 2010, 8:54pm

Hi !!!

Wow, I am having the same problem as well. Can someone please help us … It works well for me in emulation mode, but not on the GPU … I have been hitting my head over this on the table for a few hours now and I don’t know who to fix it. I posted my problem and code earlier.

Thanks,

Hi ladies and gents!

I have problem with sum calculation in CUDA. I have an struct which contains an array of 784 elements and i need to calculate it’s sum.

I tried to modify the code from nvidia reduction example, but i still get wrong sum. Maybe I used bad block and thread dimensions.

Please, can anybody post a small kernel function which will compute the sum of this array? - please include blockidx and threadidx dimensions.

Or any help how to do this will help me.

Thanks in advance.

PS: sorry for my English, it’s not my native language.

I tried to figure it out on a simple example which will calculate sum of 16 elements, but i again get wrong result

here is the code:

[codebox]

include <stdio.h>

include <stdlib.h>

include <cutil_inline.h>

include <cuda.h>

global void

reduce0(float* g_idata,float* g_odata, unsigned int n)

{
extern __shared__ float temp[];

int thid = threadIdx.x;

int pout = 0, pin = 1;

temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;

__syncthreads();

for(int offset = 1;offset < n; offset *= 2)

{

	pout = 1 - pout;

	pin = 1 - pout;

	if(thid >= offset)

		temp[pout*n+thid] += temp[pin*n+thid - offset];

	else

		temp[pout*n+thid] = temp[pin*n+thid];

	__syncthreads();

}



g_odata[thid] = temp[pout*n + thid];
}

int main(int argc, char **argv)

{

float *data;

float *odata;

data=(float *)malloc((16)*sizeof(float));

odata=(float *)malloc((16)*sizeof(float));

for(int i=0;i<16;i++){data[i] = (float)i/100; printf(“data[%d]=%f\n”,i,data[i]);}

float *g_idata;

float *g_odata;

cutilSafeCall( cudaMalloc( (void**) &g_idata, (16)*sizeof(float)));

cutilSafeCall( cudaMalloc( (void**) &g_odata, (16)*sizeof(float)));

cutilSafeCall( cudaMemcpy( g_idata, data, 16*sizeof(float), cudaMemcpyHostToDevice) );

reduce0<<<2, 8>>>(g_idata,g_odata, 16);

cudaMemcpy(odata, g_odata, 16*sizeof(float), cudaMemcpyDeviceToHost);

for(int i=0;i<16;i++){printf(“data[%d]=%f\n”,i,odata[i]);}

system(“PAUSE”);

}

[/codebox]

xmonraz · April 29, 2010, 9:40pm

I haven’t looked closely at your code, but right off the bat, you should need a cudaThreadSynchronize() to block the kernels queued up before you copy back the results.

Cygnus_X1 · April 29, 2010, 10:41pm

You don’t need cudaThreadSynchronize to copy back the results.
What you do need however is to reserve memory for your extern shared variable. The kernel call goes like this <<<grid,block,shmem>>>, where shmem is amount of shared memory (in bytes) reserved for your extern shared variables (in your case: temp).
Because the value is ommited, 0 is assumed and temp most likely overwrites other variables (and other variables overwrite temp)

xmonraz · April 29, 2010, 10:54pm

I’d recommend looking at the original white paper
[url=“CUDA Toolkit Documentation”]http://developer.download.nvidia.com/compu...an/doc/scan.pdf[/url]

Another thing that was apparent that you’re calling 2 blocks of blocksize 8. It states in the paper:
“CUDA C code for the naive scan algorithm. This version
can handle arrays only as large as can be processed by a single
thread block running on one multiprocessor of a GPU.”

So you’d have to change the way you call it as well.

naryan · April 29, 2010, 11:01pm

Thank you for your reply. I don know if I am doing it right…

I called the kernel with <<<30,240,16*sizeof(float)>>> but the result i got was #QNAN0…

Am I doing something wrong?

Feel free to edit my code, if you need to.

xmonraz · April 30, 2010, 12:42am

You should really go through the programming guide again, if you haven’t already.

The parameters when calling a kernel are: << num_blocks_in_grid, num_threads_in_block, shared_mem_size >>

So given that the example code says its only supposed to work with 1 block with as many threads are there are elements, you have to specificy 1 block with, in your example, 16 threads in that block. As Cygnus explained, you also need the shared memory size, which in this case would be 16*sizeof(float).

I’ve cut your kernel down a bit to make the point more obvious,

__global__ void

reduce0(float* g_idata,float* g_odata, unsigned int n) {

	extern __shared__ float temp[];

	int thid = threadIdx.x;

	temp[thid] = g_idata[thid];

	__syncthreads();

	

	for(int offset = 1;offset < n; offset *= 2) {

		if(thid >= offset)

			temp[thid] += temp[thid - offset];

		__syncthreads();

	}

	g_odata[thid] = temp[thid];

}

Call it with: reduce0<<<1, 16, 16*sizeof(float) >>>

naryan · April 30, 2010, 9:02am

xmonraz thank you very much. Your code worked perfectly.
This helped me a lot. :">

Now i am going to read again programing guide, to find more answers External Image
Thanks all of you for your patience.

Topic		Replies	Views
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1270	November 8, 2023
Cant modify data on the GPU CUDA Programming and Performance	16	10239	December 20, 2008
device global memory update questions CUDA Programming and Performance	7	5842	April 20, 2009
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
Debugging inside my kernel I'm sure its something small. CUDA Programming and Performance	12	10391	June 19, 2008
Summing threads CUDA Programming and Performance	3	3082	June 7, 2011
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4082	August 26, 2018
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2195	November 23, 2009
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
How to perform CUDA programming to calculate and process the correct planning CUDA Programming and Performance	3	522	April 28, 2018

CUDA - calculation of a sum

Related topics