the result of cuda-gdb is correct,but running as normal is error

template <typename T,typename ITYPE>
207 global void SpMVKernel1D(ITYPE top,ITYPE * bitsum,T * num,ITYPE lowrowbits,ITYPE lowrowmask,ITYPE highrowmask,ITYPE lowcolbits,ITYPE lowcolmask,ITYPE highcolmask,T * x,T * y)
208 {
209 ITYPE blockColSum=gridDim.x;
210 ITYPE blockRowSum=gridDim.y;
211 ITYPE blockRow=blockIdx.y;
212 ITYPE blockCol=blockIdx.x;
213 ITYPE blockRowWidth=blockDim.y;
214 ITYPE blockColWidth=blockDim.x;
215 ITYPE threadRow=threadIdx.y;
216 ITYPE threadCol=threadIdx.x;
217 T * bnum=(T )num;
218 ITYPE * btop ;
219 ITYPE rhi=blockRow
(lowrowmask+1);
220 T * suby=(T )(y+rhi);
221 ITYPE chi=blockCol
(lowcolmask+1);
222 T * subx=(T )(x+chi);
223 ITYPE * bbitsum;
224 ITYPE nonzero;
225 ITYPE sum;
226 ITYPE flag=0;
228 btop=(ITYPE )(top+(blockRow(blockColSum+1)+blockCol));
230 sum=btop[0];
231 nonzero = btop[1]-btop[0];
232
233 bbitsum=(ITYPE )(bitsum+(blockRowblockColSum
blockRowWidth+blockCol
blockRowWidth));
234 shared ITYPE sharedBitSum[sharedLength];
235
236 sharedBitSum[threadRow]=bbitsum[threadRow];
237 __syncthreads();
238 ITYPE linebit;
239 if (nonzero>0)
240 {
243 for(ITYPE blocki=0;blocki<threadRow;blocki++)
244 {
247 linebit=sharedBitSum[blocki];
248 for (ITYPE blockj=0;blockj<lowcolmask+1 && linebit!=0;blockj++)
249 {
250 ITYPE bit =linebit&1;
251 if (bit>0)
252 {
253 sum++;
255 }
256 linebit>>=1;
258 }
259 }
261 linebit=sharedBitSum[threadRow];
262 T tempy=0.0f;
263 for (ITYPE blockj=0;blockj<lowcolmask+1 && linebit!=0;blockj++)
264 {
265 ITYPE bit=linebit&1;
266 if (bit>0)
267 {
268 flag=1;
269 tempy+=num[sum++]*subx[blockj];
}
278 }
281 suby[threadRow]+=tempy;//write to global memory
284 }
285
289 }

upper code is the kernel,
the main idea is that different thread compute the partial result,and after the program has finished, we get final result.

The problem is below:
If I use cuda-gdb, I will set breakpoint at 209,I could get the correct result, and if I run the program normally, there would be error.

the correct result is :4 4 6 6 6 6 6 6 4 4
the wrong result is:2 2 2 2 2 2 2 2 2 2

The problem has been found which is in line 281,the value of tempy can not been added up during multiple threads are running at the same time.

But I didn’t know how to solve it.