how to add


int i=0;
when running on CPU , it’s easy to get i++;
Gpu when EmuDebug,it’s ok, but when Release it doesn’t work;
Is there any good method to so this?

I used AtomicAdd but found poor performance, rather cost time.

Where is “i” stored? Is it a normal variable (therefore in a register), a shared memory variable, or a global memory variable?

No matter it is a a normal variable or a shared memory variable,that’s all OK.

I just wanna get the result of i++;

Are you trying to display the value of “i” from inside a GPU Kernel function using a printf? That is possible only in the emulation mode.

When actually running the code on the GPU (release), you cannot access any non-device or non-global functions (hence standard c and cpp library functions don’t work from inside the kernel).

The only way in this case to get the value of ‘i’ is to write it to some global memory location and read it back after the kernel execution is complete.

define ‘get’

Thank you for your answer, right, “printf” only work under emulation mode,but that’s not my point.

For example, I wanna get the number when idata[i]>threshold,


global void kernel(float *idata,int threshold,float *odata)


    __shared__ int smem[0]; 

if( idata[threadIdx.x] > threshold)



odata[threadIdx.x] = idata[threadIdx.x] /smem[0];


but in Release mode smem[0] only be 1,but in emuDebug smem[0] accumulate.

So how can I get it accumulate?

You are trying to have multiple threads write to the same shared memory location. This is a classic example of concurrent write problem, and as far as I understand it, if multiple threads write to the same shared memory location, only one of them succeed in CUDA.

Please refer to Page 9, Para 4 of the Histogram on CUDA document.…c/histogram.pdf

You may want to consider using the code sample for Reduction to help you with this problem:…html#reduction

I’ve tried to imitate the method of histogram to make accumulation,but it didn’t

work,may be I made mistakes, so I ask for help if you have related experiences.

What does the abvoe statement achieve? Declare an array with 0 elements???


I thought It can be declared device shared int num as global or

shared int num[1] in the kernel function;

I just expect the it could accumulate per block,but it only works in emulation mode.

And in histogram example, there is a for(); to make it accumulate.

Anybody would tell me that how to do?

Well, it is not clear what you want to achieve. But I can tell you 1 thing: normally you have shared memory of at least the amount of threads per block. You can not write to 1 shared memory location from all threads at the same time. The outcome is undetermined.

In emulation mode each thread writes to the location sequentially, but on the device that happens parallel.

shared int smem[0] means nothing. No elements nothing… Its useless.

Dont rely on device emulation. It does NOT model the underlying parallel hardware correctly.

Finally, “I” is dangerous. Very few men have understood what exactly I is.