The result is unpredictable.

Hello, everyone. I wrote some CUDA code like follows.I don’t know why every time my result is different, I have already appended __syncthreads().Can anybody give me the suggestions, thanks in advance.

__global__ void MultiplyCoef_all(short* dst, short const *src, int *height , int *width)//, int N , const *coeff)
	__shared__ int sharebuffer[512];
	int i=blockDim.x*blockIdx.x+threadIdx.x;
	int tid = threadIdx.x;
	int Thread_Threshold;
	int ThreadCount;
	if(i < (*width)*(*height)*N_constant)
	sharebuffer[tid] = src[i]*coeff_constant[tid%N_constant];
	for( int stride = N_constant; stride > 1; stride>>=1)
		Thread_Threshold = blockDim.x >> 1;
		ThreadCount = (stride/2)*((2*tid)/stride)+tid;
		if( tid < Thread_Threshold)
			sharebuffer[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];

	int ReturnStride = (blockDim.x/N_constant);
	if( tid <ReturnStride)
		dst[tid+ReturnStride*blockIdx.x] = (sharebuffer[tid]+offset_constant) >> shift_constant;


The __syncthreads() only synchronize the threads in a block. The threads in different blocks are not synchronized.

The last __syncthreads() is not needed.

It is possible that there is out of bounds access. You can check by adding -g -G flags to the compiling and by using cuda-memcheck.

Thanks, do you know how to confirm the blocks that are all done??


All blocks which are launched are always done. If you have 2 kernel calls one after each other in the same stream the second kernel will not start until the first one is done.

Each architecture has a limit on how many blocks are launched. On cc 2.0 the limit is 65000. On cc 3.0 and higher is it millions. If you compile with --arch=sm_20 the blokcs over 65000 will not be executed, sometimes the code will just crash, but not always.

I checked my code and I couldn’t find anywhere that my blocks will effect the results, only transfer to global memory from share memory, but I knew when leaving kernel function that means all thread are done, so I don’t worry data doesn’t transfer to global memory.I confirm my blocks is less than definition.Where would make my result is unpredictable ??

A good way to avoid thread racing or interference is to create two separate arrays.

sharebuffer[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];

Instead, you can try

sharebufferNEW[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];

After you finish the loop, you can do sharebuffer = sharebufferNEW (if you have to utilize the loop again). This way your threads will never interfere with each other. However, do note this will cost you some memory space as you will be holding two arrays as opposed to one.

The issue comes from when you are assigning indices to your equations that do not match the thread id (or thread index).

A[i] = A[i] + A[i] will work.
A[i] = A[i+1] + A[i-1] will not work because you’ll have threads interfering with each other.
Anew[i] = A[i+1] + A[i-1] will work.

Thanks, your suggestion makes me learn a lot :)