cudaThreadSynchronize usage

hi

When do i need to call ‘cudaThreadSynchronize’ ?
now i have weird ‘bug’ inside my program, something like this works fine:

cudaMemcpy(SomeMemory, …)
cudaThreadSynchronize();
Kernel<<<…>>>(SomeMemory);

… and something like this do not work (‘unknown error’ inside kernel)

cudaMemcpy(SomeMemory, …)
Kernel<<<…>>>(SomeMemory);

do i need to call cudaThreadSynchronize after cudaMemcpy’s and befor kernel lunches ?
(i dont think so, so i’m asking ;))
definitivelly there is NO out of bound aceess - this is double checked … so driver bug ?

(my spec: latest drivers, 64bit app, win vista x64 sp1, 2 x gf 280GTX)

another case is subsequent kernel lunches on same memory:

Kernel_Iterate<<<…>>>(SomeMemory)
Kernel_Iterate<<<…>>>(SomeMemory)
Kernel_Iterate<<<…>>>(SomeMemory)
Kernel_Iterate<<<…>>>(SomeMemory)

do i need to cudaThreadSynchronize between kernels, or they are quered on the driver side and then executed IN ORDER ?

thanks for clarifications.

No, you shouldn’t need cudaThreadSynchronize() in any of those cases. The driver will execute these operations in order.

I think there is an environment variable that can make calls “async”.

Check out the manual and see if that is set. If that is set, un-set it now.

I might be wrong though… btw, I agree with What seibert has said.

i’m avare that kernel lunch is async operation ;)
Thanks for clarification - so it is bug (in my code or in driver) now i could start digging :)