Multiple CPU threads Performance hit

earl · February 27, 2008, 4:41am

Without going into too detailed description of my setup, would there be any obvious reason why 2 operating system threads (CPU) running CUDA kernels and performing memcpy’s etc, would be more than 4 times slower than just a single thread?

Both threads are doing the same work on different data (2 camera’s inputs).

I expected to see a linear increase in times.

A single thread completes its processing in 5ms, whereas 2 threads complete their processing in 24ms.

And yes - I’m calling cudaThreadSynchronize to ensure accurate timing.

Thanks for any light anyone may be able to bring to the problem! I’ll post more details if required.

System:

WinXP
CUDA 1.1
Visual Studio 2005
Geforce 8800 GTX
Intel Core 2 CPU 6700 @ 2.66GHz
2GB RAM

Earl

AndreiB · February 27, 2008, 8:07am

Do you have 2 physical CPU cores? Problem is that GPU performance may degrade if you have another thread consuming CPU cycles (and cudaTHreadSynchronize() causes 100% CPU load).

Anyway, I’d suggest you to profile both versions and compare gputime and cputime in both cases.

wumpus · February 27, 2008, 1:00pm

As you’re using the same GPU from two threads this means there is a lot of context switching. This might hurt a lot.

earl · February 27, 2008, 11:17pm

I tried adding a lock around the chunk of CUDA processing code, thus reducing the number of context switches down to 1 per thread, but this didn’t make any difference.

I thought I read somewhere that context switching was a relatively quick task?

Thanks,

Earl

wumpus · February 28, 2008, 9:27am

Putting a lock around the CUDA processing code is an interesting idea, but as CUDA calls are asynchronous, it isn’t always the case that they ended by the time the lock ends. Still, you’d expect some change in behaviour.

Do the cuda profile logs show the same times for kernel launches for 1 and 2 threads?

earl · February 28, 2008, 8:23pm

Before releasing the lock I’m also doing a cudaThreadSynchronize, so in conjunction with the lock that should act to serialize the GPU.

I have run the visual profiler over the code, and strangely there is a weird, sporadic spiking pattern to the CPU overhead time when run with 2 threads (GPU time remains the same as a single thread). With a single thread, the CPU overhead remains almost entirely stable (Out of interest, can anyone explain where the CPU overhead for a kernel launch comes from and why it varies for different kernels?).

I’ve been trying to work it through in my head, but I keep coming back to the fact that if both threads were completely serialized, it should simply take twice as long, so where is the difference with running both threads simultaneously?

Thanks,

Earl

Topic		Replies	Views
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	241	July 7, 2024
Kernel Launch: number of blocks CUDA Programming and Performance	1	1696	May 21, 2009
Single or multiple CPU threads using same GPU? CUDA Programming and Performance cuda , performance	5	2608	September 14, 2023
cuda with multicore (multitasking) multicore CPU(for multitasking) and CUDA CUDA Programming and Performance	13	12030	February 23, 2009
Multi GPU results in latencies in Linux CUDA Programming and Performance	4	1895	April 25, 2012
Using <<<...>>> CUDA Programming and Performance	6	2478	June 19, 2011
CUDA perormances CUDA Programming and Performance	10	7129	January 22, 2008
Invoking kernel from multiple PC processes CUDA Programming and Performance	1	5501	June 3, 2011
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009
Is CUDA thread-safe? CUDA Programming and Performance	3	12657	February 18, 2008

Multiple CPU threads Performance hit

Related topics