problem about cudaThreadSynchronize()

Here is my problem:

I need to call different kernels, so are the first and second case below same?

case1:

call kernel1 <<<grid,block>>>(…)
cudaThreadSynchronize();
call kernel2 <<<grid,block>>>(…)
cudaThreadSynchronize();

case2:
call kernel1 <<<grid,block>>>(…)
call kernel2 <<<grid,block>>>(…)
cudaThreadSynchronize();

Will I get some benefit from the omitting of one cudaThreadSynchronize()?

Thanks :)

I can only answer this for CUDA 1.0:
Yea it’s the same. Kernel calls are queued by the driver and executed sequentially. No kernels run in parallel. There is no benefit from omitting one of the calls.

The same is true for CUDA 1.1. If you do use the stream API in 1.1, kernel calls and memcopies in the same stream are queued up. Non-streamed API in 1.1 looks and behaves just like 1.0, so it’s completely backward compatible.

Paulius

Setting up a kernel invocation isn’t free for the CPU, so case1 loses some benefit of concurrent CPU/GPU execution. If the kernels are doing a small enough amount of work that the driver overhead of the kernel invocations is noticeable, case2 is preferable because the CPU can set up the call to kernel2 while kernel1 is executing.