My CUDA host function has the following structure:
…
My_kernel <<< … >>> ( … );
CudaThreadSynchronize(); ---- (1)
…
… ---- (2)
…
…
return;
where (1) takes about 3 ms to wait all threads complete, and (2) is some process that are independent to the kernel and takes about 2 ms.
I want to let CPU run (2) while waiting the CUDA threads, and synchronize the thread after (2) is done. So I modify the codes to:
…
My_kernel <<< … >>> ( … );
…
… ---- (2)
…
CudaThreadSynchronize(); ---- (1)
…
return;
In this setting, the kernel launch take only negligible time, and (2) also take 2 ms to do its works.
But (1), the CudaThreadSynchronize() methods, still take 3 ms to wait all threads, and the overall time does not decrease at all.
In order to test, I replace (2) by some dummy codes like this :
My_kernel <<< … >>> ( … );
int i;
int dummy = 1;
for (i = 0 ; i < 10000000; i++)
dummy = dummy * 2 % 10; ---- (2)
dump_value[0] = dummy; // let the compiler not automatically remove the dummy code
CudaThreadSynchronize(); ---- (1)
While (2) now takes 100~1000 ms and are completely independent to (1), CudaThreadSynchronize() still takes 3 ms to do the synchronization. It seem that the CUDA threads do not actually run until CudaThreadSynchronize() is used.
Is there a way to let CUDA threads run parallel to the CPU process, and synchronize them after CPU is done? Or it is impossible in current CUDA architecture?