Multiple CPU threads Performance hit

Without going into too detailed description of my setup, would there be any obvious reason why 2 operating system threads (CPU) running CUDA kernels and performing memcpy’s etc, would be more than 4 times slower than just a single thread?

Both threads are doing the same work on different data (2 camera’s inputs).

I expected to see a linear increase in times.

A single thread completes its processing in 5ms, whereas 2 threads complete their processing in 24ms.

And yes - I’m calling cudaThreadSynchronize to ensure accurate timing.

Thanks for any light anyone may be able to bring to the problem! I’ll post more details if required.


CUDA 1.1
Visual Studio 2005
Geforce 8800 GTX
Intel Core 2 CPU 6700 @ 2.66GHz


Do you have 2 physical CPU cores? Problem is that GPU performance may degrade if you have another thread consuming CPU cycles (and cudaTHreadSynchronize() causes 100% CPU load).

Anyway, I’d suggest you to profile both versions and compare gputime and cputime in both cases.

As you’re using the same GPU from two threads this means there is a lot of context switching. This might hurt a lot.

I tried adding a lock around the chunk of CUDA processing code, thus reducing the number of context switches down to 1 per thread, but this didn’t make any difference.

I thought I read somewhere that context switching was a relatively quick task?



Putting a lock around the CUDA processing code is an interesting idea, but as CUDA calls are asynchronous, it isn’t always the case that they ended by the time the lock ends. Still, you’d expect some change in behaviour.

Do the cuda profile logs show the same times for kernel launches for 1 and 2 threads?

Before releasing the lock I’m also doing a cudaThreadSynchronize, so in conjunction with the lock that should act to serialize the GPU.

I have run the visual profiler over the code, and strangely there is a weird, sporadic spiking pattern to the CPU overhead time when run with 2 threads (GPU time remains the same as a single thread). With a single thread, the CPU overhead remains almost entirely stable (Out of interest, can anyone explain where the CPU overhead for a kernel launch comes from and why it varies for different kernels?).

I’ve been trying to work it through in my head, but I keep coming back to the fact that if both threads were completely serialized, it should simply take twice as long, so where is the difference with running both threads simultaneously?