The result is unpredictable.

baby1120 · October 13, 2013, 6:30am

Hello, everyone. I wrote some CUDA code like follows.I don’t know why every time my result is different, I have already appended __syncthreads().Can anybody give me the suggestions, thanks in advance.

__global__ void MultiplyCoef_all(short* dst, short const *src, int *height , int *width)//, int N , const *coeff)
{
	__shared__ int sharebuffer[512];
	int i=blockDim.x*blockIdx.x+threadIdx.x;
	int tid = threadIdx.x;
	int Thread_Threshold;
	int ThreadCount;
	if(i < (*width)*(*height)*N_constant)
	{
	sharebuffer[tid] = src[i]*coeff_constant[tid%N_constant];
	__syncthreads();
	for( int stride = N_constant; stride > 1; stride>>=1)
	{
		Thread_Threshold = blockDim.x >> 1;
		ThreadCount = (stride/2)*((2*tid)/stride)+tid;
		if( tid < Thread_Threshold)
			sharebuffer[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];
		__syncthreads();
	}

	int ReturnStride = (blockDim.x/N_constant);
	if( tid <ReturnStride)
		dst[tid+ReturnStride*blockIdx.x] = (sharebuffer[tid]+offset_constant) >> shift_constant;
	__syncthreads();
	}
}

pasoleatis · October 13, 2013, 10:13am

Hello,

The __syncthreads() only synchronize the threads in a block. The threads in different blocks are not synchronized.

The last __syncthreads() is not needed.

It is possible that there is out of bounds access. You can check by adding -g -G flags to the compiling and by using cuda-memcheck.

baby1120 · October 13, 2013, 5:03pm

Thanks, do you know how to confirm the blocks that are all done??

pasoleatis · October 13, 2013, 5:54pm

Hello,

All blocks which are launched are always done. If you have 2 kernel calls one after each other in the same stream the second kernel will not start until the first one is done.

Each architecture has a limit on how many blocks are launched. On cc 2.0 the limit is 65000. On cc 3.0 and higher is it millions. If you compile with --arch=sm_20 the blokcs over 65000 will not be executed, sometimes the code will just crash, but not always.

baby1120 · October 13, 2013, 6:15pm

I checked my code and I couldn’t find anywhere that my blocks will effect the results, only transfer to global memory from share memory, but I knew when leaving kernel function that means all thread are done, so I don’t worry data doesn’t transfer to global memory.I confirm my blocks is less than definition.Where would make my result is unpredictable ??

blade613x · October 14, 2013, 3:32am

A good way to avoid thread racing or interference is to create two separate arrays.

sharebuffer[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];

Instead, you can try

sharebufferNEW[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];

After you finish the loop, you can do sharebuffer = sharebufferNEW (if you have to utilize the loop again). This way your threads will never interfere with each other. However, do note this will cost you some memory space as you will be holding two arrays as opposed to one.

The issue comes from when you are assigning indices to your equations that do not match the thread id (or thread index).

A[i] = A[i] + A[i] will work.
A[i] = A[i+1] + A[i-1] will not work because you’ll have threads interfering with each other.
Anew[i] = A[i+1] + A[i-1] will work.

baby1120 · October 25, 2013, 2:00am

Thanks, your suggestion makes me learn a lot :)

Topic		Replies	Views
__syncthreads screwes calculation CUDA Programming and Performance	2	3424	November 22, 2007
syncthread race condition CUDA Programming and Performance	1	1606	December 13, 2012
__syncthreads() not syncing the threads, although not in if statement CUDA Programming and Performance	1	655	April 26, 2016
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	5	2680	December 29, 2011
__syncthreads() issue CUDA Programming and Performance	10	1272	February 10, 2011
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3479	April 30, 2014
__syncthreads() not syncing all threads in my thread block CUDA Programming and Performance synchronization	2	1168	April 6, 2021
using syncthreads still at n00b status CUDA Programming and Performance	4	16098	December 1, 2010
problem with __syncthreads(); CUDA Programming and Performance	1	1697	December 15, 2011
Problems with __syncthreads() CUDA Programming and Performance	2	950	May 4, 2013

The result is unpredictable.

Related topics