CUDA slows down using threads

I’m using a Tesla C2050 and a Quadro FX5800.

For some odd reason, when I run my code on a thread instead of the main thread… using linux pthreads… my cuda code slows down by a factor of 2.

Any hints as to why this would be and/or how to resolve this issue?


Which version of CUDA you are using? Do you invoke cuda functions only by pthreads thread?

In CUDA till version 3.2, only one host thread can access one GPU context, I mean CUDA API.

In version 4.0 it is possible to run many host tread which are using one GPU context. So if you are using some previous version of CUDA 4.0, and you invoke for example cudaAlloc in a host main thread, and then you execute your kernel in pthreads thread you may have encountered problems.

Well, it’s 3.2. I do run commands from both the “host thread” and single subthread, but not simultaneously. That’s only because the openGL stuff crash last time I ran it from the subthread… otherwise, I could run it all from the subthread.

At first, there was a big slow down from it having to generate a new context, because I was generating a new subthread each “cycle” in my software… thus generating a new context. That caused a 4 or more second slow down. So, this I fixed by having it only use one subthread. But it’s still half speed.

The program starts in the main thread, calls some CudaMallocs from the main thread, then creates a subthread which does cuda stuff while I have the host thread looping idle, and it still slows down. The main thread then runs some cuda stuff, but not simultaneously to the subthread.

I wouldn’t mind switching to 4.0, but that’s tsill Beta, isn’t it?

You can not allocate memory from main thread then use that memory from a pthread, indeed till version 3.2 you can not share memory address between threads,

so let say while CudaMallocs returns 0xDEADBEAF that address is meaningless in a pthread.


If you are doing that in CUDA 3.2 without using the context migration API, you still have two contexts…

Ah, right. That stuff works just fine. It’s really mallocking some host memory stuff outside the pthread, and that works.

Anyway, if I sleep the main thread the subthread goes back to full speed… the main thread was running some OpenGL/GLUT stuff on the Quadro.

I guess I have to use OpenMP to have the threads running on different processor cores.

Right, so that probably means your compute code and rendering code are both running on the Quadro.

That will make no difference. pthreads will run on different cores if they are available without any further intervention.

They’re definately not running both on the quadro.

I moved it to OpenMP to see if it’d grab different cores, but I still need to put sleeps in the main thread to have the CUDA stuff run at full speed on the subthread.

On the main thread, there is some CUDA code being run though, on the Tesla… just not simultaneously to the CUDA stuff being run on the subthread.

Is it that switching contexts between the two threads causing the slow down? If that were it, why would sleeping the main thread make a difference?

However, it is a “hyperthreaded” processor or whatever, such that each core has “two threads” to make switching faster. could that be it? OpenMP tells me there are 16 processors available (when there are really 8). However, when I remove the sleep statements top shows me I have 200% CPU usag.e