Can kernel function parallel with CPU code?

My CUDA host function has the following structure:

My_kernel <<< … >>> ( … );
CudaThreadSynchronize(); ---- (1)

… ---- (2)


where (1) takes about 3 ms to wait all threads complete, and (2) is some process that are independent to the kernel and takes about 2 ms.
I want to let CPU run (2) while waiting the CUDA threads, and synchronize the thread after (2) is done. So I modify the codes to:

My_kernel <<< … >>> ( … );

… ---- (2)

CudaThreadSynchronize(); ---- (1)


In this setting, the kernel launch take only negligible time, and (2) also take 2 ms to do its works.
But (1), the CudaThreadSynchronize() methods, still take 3 ms to wait all threads, and the overall time does not decrease at all.

In order to test, I replace (2) by some dummy codes like this :

My_kernel <<< … >>> ( … );

int i;
int dummy = 1;
for (i = 0 ; i < 10000000; i++)
dummy = dummy * 2 % 10; ---- (2)

dump_value[0] = dummy; // let the compiler not automatically remove the dummy code

CudaThreadSynchronize(); ---- (1)

While (2) now takes 100~1000 ms and are completely independent to (1), CudaThreadSynchronize() still takes 3 ms to do the synchronization. It seem that the CUDA threads do not actually run until CudaThreadSynchronize() is used.

Is there a way to let CUDA threads run parallel to the CPU process, and synchronize them after CPU is done? Or it is impossible in current CUDA architecture?

The way you are running is the right way to parallelize CPU and GPU code…

Possible that the 3ms delay is a static one corresponding to cudaThreadSynchronize();

COnsider increasing your kernel execution time… and check it out…

If this does NOT work right, we all are here for a surprise…

How do you measure time?

I know that it is strange, but the kernel seems do not run at all until I call cudaThreadSynchronize();

Currently my code is like that:

My_kernel <<< … >>> ( … );

int i;

int dummy = 1;

for (i = 0 ; i < 20000000; i++)

dummy = dummy * 2 % 10;

h_dump[0] = dummy; // let the compiler not automatically remove the dummy code


I add some redundant codes to let my kernel run 40ms rather than 3ms in normal, and the dummy code cost about 160ms, these values are obviously greater than the synchronize overhead.

But after the CPU spends 160ms to complete the dummy code, the line cudaThreadSynchronize(); still wait 40ms for the launched threads. In other words, the CPU does not blocked by the launched kernel, but the kernel also does not run when the CPU is doing its work.

The time is measured by assembly codes which read the CPU ticks:

#define ReadTSC(x) __asm cpuid \

__asm rdtsc \

__asm mov dword ptr x, eax \

__asm mov dword ptr x+4, edx

so I can compute the time cost like:




printf(“Launch kernel : %.4f (ms)”, float(end_tick - start_tick) / 2400000); // 2.4GHz CPU

I am sure that these codes do not affect the process time, since the overall time cost of host function is still the same when I comments all of them. My problem is that no matter where I put cudaThreadSynchronize(), it requires the same time to wait and I cannot get any gain for parallelizing.

Oopss… Thanks for the experiments. THis is just UN-INTENDED!

I remember seeing the same thing before. I did some CPU overlap to increase speed-ups and found that nothing really happened. I just thought the CPU was doing it too fast to be noticed… NOw, what you say may be true.

Which CUDA version are you using?

You should consider filing a bug-report!

See the sticky entries in the forum that tells you how to write a bug report. THe best thing would be to become a registered developer and file bug reports.

Best REgards

What is your CPU? If you’re running multicore, you might have some problems with this code. I’d suggest you to use documented way to measure time. For Windows it is QueryPerformanceFrequency() and QueryPerformanceCounter(), for Linux I don’t know. Or you can limit thread affinity mask to single core. You may also take a look on CUDA Events API.

gettimeofday is the best in linux.

It is Intel Core2Quad Q6600 2.40GHz, but I only run my function in single thread even for the CPU code, so I think the time is correct.

My CUDA version is 2.0 and the graphics is NVIDIA GeForce 9800 GTX+.

I also try another way to synchronize the threads:

cudaEvent_t evt;

My_kernel<<< … >>>( … );

cudaEventRecord(evt, NULL);
while (cudaEventQuery(evt) == cudaErrorNotReady)


but it still doesn’t work, even if I place the while-sleep loop after CPU code, it still takes the same time to wait.

It may really be a bug because my code DO RUN IN PARALLEL in my colleague’s computer …

The difference between us:
CPU : Intel Core2Quad Q6600 2.40GHz
Graphics : NVIDIA GeForce 9800 GTX+
OS : Windows Vista 32 bits
CPU : Intel Core2Dual E8400 3.00GHz
Graphics : NVIDIA GeForce 9600
OS : Windows XP SP3

Both of our CUDA driver are updated to 180.48 (of course one for Windows Vista 32bits and one for XP).

Maybe it is a bug occur in Vista of somewhere else?

This looks to be a bug!


CAn you file a problem report and create a new thread?? – THat will find NVIDIA Moderator’s attention. It worked for me before.

I would say it is a very big issue…

But the master-slave multi-GPU setup that Mr.anderson had used will help to get out of this… One should use a “callAsync” on a “cudaThreadSynchronize()” and that would workaround this problem.

I just test the code on another friend’s computer. His graphic card is identical to mine but the OS is also Windows XP. And the codes also run parallelly as normal…

Can you tell me where to report a bug? I have read the sticky topic about the writing of bug report.

I will also post this problem to “CUDA on Vista” forum.

If you use one thread or two is irrelevant, rdtsc reads the time of the current core, which for some CPUs can be different from the time of core on which your thread previously ran, and if you have a true multi-CPU system they will be different.

And particularly Windows likes to move threads to other cores for no good reason at all.

You really need to know very well what you are doing when using rdtsc, which is why I’d usually advise against it (also the other functions give you values you can easily display as/convert to seconds).

I just tested on another computer with Windows Vista, and the problem remains, but I also find an easy workaround to solve this problem:

My_Kernel <<< … >>> ( … );

cudaEvent_t evt;

cudaEventRecord(evt, NULL); // force start CUDA threads

… — Let CPU do its work before the threads complete



The cuda event is often used for non-busy waiting, but I found that cudaEventRecord() make the threads run while they don’t in normal for Windows Vista. Something important but strange is that you must put cudaEventDestroy(evt) AFTER your CPU code, or the threads will not run parallel with CPU and this workaround is not worked.