how to add

luca · May 26, 2008, 3:25am

Hello,

int i=0;
when running on CPU , it’s easy to get i++;
Gpu when EmuDebug,it’s ok, but when Release it doesn’t work;
Is there any good method to so this?

I used AtomicAdd but found poor performance, rather cost time.

seibert · May 26, 2008, 3:35am

Where is “i” stored? Is it a normal variable (therefore in a register), a shared memory variable, or a global memory variable?

luca · May 26, 2008, 4:42am

No matter it is a a normal variable or a shared memory variable,that’s all OK.

I just wanna get the result of i++;

suhailrehman · May 26, 2008, 5:13am

Are you trying to display the value of “i” from inside a GPU Kernel function using a printf? That is possible only in the emulation mode.

When actually running the code on the GPU (release), you cannot access any non-device or non-global functions (hence standard c and cpp library functions don’t work from inside the kernel).

The only way in this case to get the value of ‘i’ is to write it to some global memory location and read it back after the kernel execution is complete.

E.D_Riedijk · May 26, 2008, 5:29am

define ‘get’

luca · May 26, 2008, 5:29am

Thank you for your answer, right, “printf” only work under emulation mode,but that’s not my point.

For example, I wanna get the number when idata[i]>threshold,

kernel<<<1,256>>>();

global void kernel(float *idata,int threshold,float *odata)

{

    __shared__ int smem[0];

if( idata[threadIdx.x] > threshold)

          smem[0]++;  

     __syncthreads();

odata[threadIdx.x] = idata[threadIdx.x] /smem[0];

}

but in Release mode smem[0] only be 1,but in emuDebug smem[0] accumulate.

So how can I get it accumulate?

suhailrehman · May 26, 2008, 5:54am

You are trying to have multiple threads write to the same shared memory location. This is a classic example of concurrent write problem, and as far as I understand it, if multiple threads write to the same shared memory location, only one of them succeed in CUDA.

Please refer to Page 9, Para 4 of the Histogram on CUDA document.
[url=“CUDA Toolkit Documentation”]http://developer.download.nvidia.com/compu...c/histogram.pdf[/url]

You may want to consider using the code sample for Reduction to help you with this problem:

[url=“http://www.nvidia.com/object/cuda_sample_perf_strategies.html#reduction”]Page Not Found | NVIDIA

luca · May 26, 2008, 6:26am

I’ve tried to imitate the method of histogram to make accumulation,but it didn’t

work,may be I made mistakes, so I ask for help if you have related experiences.

Sarnath · May 26, 2008, 8:36am

What does the abvoe statement achieve? Declare an array with 0 elements???

Kandi

luca · May 27, 2008, 3:24am

I thought It can be declared device shared int num as global or

shared int num[1] in the kernel function;

I just expect the it could accumulate per block,but it only works in emulation mode.

And in histogram example, there is a for(); to make it accumulate.

Anybody would tell me that how to do?

E.D_Riedijk · May 27, 2008, 4:00am

Well, it is not clear what you want to achieve. But I can tell you 1 thing: normally you have shared memory of at least the amount of threads per block. You can not write to 1 shared memory location from all threads at the same time. The outcome is undetermined.

In emulation mode each thread writes to the location sequentially, but on the device that happens parallel.

Sarnath · May 27, 2008, 9:43am

shared int smem[0] means nothing. No elements nothing… Its useless.

Dont rely on device emulation. It does NOT model the underlying parallel hardware correctly.

Finally, “I” is dangerous. Very few men have understood what exactly I is.

Topic		Replies	Views
Shared memory write conflicts Looking for a little help... CUDA Programming and Performance	5	4907	September 7, 2007
can you give me sample code for atomicAdd()? CUDA Programming and Performance	9	48110	June 5, 2009
Help in write from local mem to the global CUDA Programming and Performance	5	5913	January 12, 2009
Variable global CUDA Programming and Performance	17	4957	January 21, 2012
problem with __shared__ on device emulator CUDA Programming and Performance	1	3518	February 24, 2009
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8281	April 15, 2011
Calling a host function from a __device__/__global__ function CUDA Programming and Performance	11	46901	June 22, 2009
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8850	August 3, 2017
How to improve the speed up? CUDA Programming and Performance	8	2289	June 2, 2009

how to add

Related topics