what happens in my code plz help me T_T

global void repeat_cuda(float *ad,float bd){
unsigned int i,j;
unsigned int gpuid=blockIdx.x
blockDim.x+threadIdx.x;
float sha[1];

__syncthreads();
for(j=0;j<10000;j++){
sha[0]=0;
for(i=0;i<1000;i++) {
sha[0] =gpuid; ************(1)
sha[0] =sha[0]+gpuid/1000.0;
}
}
__syncthreads();

bd[gpuid]=sha[0];
***********************(2)                                                 

}

this is simple speed test code.(block size=100,grid size=10)
without line *******(1) it takes 1.6sec
but if i insert *********(1) then it takes 0.00000sec
and if ***(1) goes to the line ***(2) it takes 1.6sec again

anyone knows error of this code?

The result is right, but exactly reason I don’t know. The following is just what I am guess.

I regard the parameters which define in the kernal is restored in register ,

First the cuda put the parameter you want to calculate into register it will much quicker, the same things will happen if you put the parameter you wanted to use into shared memory.

But actually why these things will happened is also confused me, maybe it depends on the structure of CUDA.

I hope it will be helpful.