KERNELS are NOT queing , bug in cuda 2.0 ? cudaThreadSynchronize(); makes no difference ?

[b]the thread is being discussed here:LINK

Sorry for multiple posts, my bad[/b]


I have a strange problem… I have the following kernels calls, which I think should be qued in ~ 3 microseconds.


	 for(int i=0;i<12;i++)







AND they are qued in 4e-3 seconds in cuda and nvcc version 2.0 which is fine.

Now if I add this kernel call

gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] ); // the GPU_R kernel does reduction sort of operations on matrices.

after the for loop of previous kernels the queuing takes 0.1 seconds ??

Hence, I am loosing all the speed up here as I want to do some CPU calculations before GPU finishes.

I then inserted cudaThreadSynchronize(); and saw that it takes the same time 0.1 seconds. :thumbsdown:

hence the program actually waits for the last kernel to finsih before transferring the control to the cpu, even though am not using any synchronization. Why will this happen ?

I dont know why this is happening :wacko: , please anyone got any ideas ?

here is the pastebin link to the gpu_r kernel if anyone thinks its due to something in the kernel (which I doubt it should be)…

thanks all