KERNELS are NOT queing , bug in cuda 2.0 ? cudaThreadSynchronize(); makes no difference ?

[b]the thread is being discussed here:LINK

Sorry for multiple posts, my bad[/b]

HI

I have a strange problem… I have the following kernels calls, which I think should be qued in ~ 3 microseconds.

gpu_ford_phia<<<dimGrid,dimBlock>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[1]);

	 for(int i=0;i<12;i++)

	 {		   

	 gpu_mtranT_prod<<<dimGrid5,dimBlock5>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

	 gpu_T_prodmtran<<<dimGrid6,dimBlock6>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

	  gpu_mT_prod<<<dimGrid4,dimBlock4>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);   

	 gpu_ford_phic<<<dimGrid3,dimBlock3>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[2],(Bspace*)adds[1],i);

	 }

AND they are qued in 4e-3 seconds in cuda and nvcc version 2.0 which is fine.

Now if I add this kernel call

gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] ); // the GPU_R kernel does reduction sort of operations on matrices.

after the for loop of previous kernels the queuing takes 0.1 seconds ??

Hence, I am loosing all the speed up here as I want to do some CPU calculations before GPU finishes.

I then inserted cudaThreadSynchronize(); and saw that it takes the same time 0.1 seconds. :thumbsdown:

hence the program actually waits for the last kernel to finsih before transferring the control to the cpu, even though am not using any synchronization. Why will this happen ?

I dont know why this is happening :wacko: , please anyone got any ideas ?

here is the pastebin link to the gpu_r kernel if anyone thinks its due to something in the kernel (which I doubt it should be)…

thanks all