Hello, everyone. I wrote some CUDA code like follows.I don’t know why every time my result is different, I have already appended __syncthreads().Can anybody give me the suggestions, thanks in advance.
__global__ void MultiplyCoef_all(short* dst, short const *src, int *height , int *width)//, int N , const *coeff)
{
__shared__ int sharebuffer[512];
int i=blockDim.x*blockIdx.x+threadIdx.x;
int tid = threadIdx.x;
int Thread_Threshold;
int ThreadCount;
if(i < (*width)*(*height)*N_constant)
{
sharebuffer[tid] = src[i]*coeff_constant[tid%N_constant];
__syncthreads();
for( int stride = N_constant; stride > 1; stride>>=1)
{
Thread_Threshold = blockDim.x >> 1;
ThreadCount = (stride/2)*((2*tid)/stride)+tid;
if( tid < Thread_Threshold)
sharebuffer[tid] = sharebuffer[ThreadCount]+sharebuffer[ThreadCount+stride/2];
__syncthreads();
}
int ReturnStride = (blockDim.x/N_constant);
if( tid <ReturnStride)
dst[tid+ReturnStride*blockIdx.x] = (sharebuffer[tid]+offset_constant) >> shift_constant;
__syncthreads();
}
}
All blocks which are launched are always done. If you have 2 kernel calls one after each other in the same stream the second kernel will not start until the first one is done.
Each architecture has a limit on how many blocks are launched. On cc 2.0 the limit is 65000. On cc 3.0 and higher is it millions. If you compile with --arch=sm_20 the blokcs over 65000 will not be executed, sometimes the code will just crash, but not always.
I checked my code and I couldn’t find anywhere that my blocks will effect the results, only transfer to global memory from share memory, but I knew when leaving kernel function that means all thread are done, so I don’t worry data doesn’t transfer to global memory.I confirm my blocks is less than definition.Where would make my result is unpredictable ??
After you finish the loop, you can do sharebuffer = sharebufferNEW (if you have to utilize the loop again). This way your threads will never interfere with each other. However, do note this will cost you some memory space as you will be holding two arrays as opposed to one.
The issue comes from when you are assigning indices to your equations that do not match the thread id (or thread index).
A[i] = A[i] + A[i] will work.
A[i] = A[i+1] + A[i-1] will not work because you’ll have threads interfering with each other.
Anew[i] = A[i+1] + A[i-1] will work.