Which version of CUDA you are using? Do you invoke cuda functions only by pthreads thread?
In CUDA till version 3.2, only one host thread can access one GPU context, I mean CUDA API.
In version 4.0 it is possible to run many host tread which are using one GPU context. So if you are using some previous version of CUDA 4.0, and you invoke for example cudaAlloc in a host main thread, and then you execute your kernel in pthreads thread you may have encountered problems.
Well, it’s 3.2. I do run commands from both the “host thread” and single subthread, but not simultaneously. That’s only because the openGL stuff crash last time I ran it from the subthread… otherwise, I could run it all from the subthread.
At first, there was a big slow down from it having to generate a new context, because I was generating a new subthread each “cycle” in my software… thus generating a new context. That caused a 4 or more second slow down. So, this I fixed by having it only use one subthread. But it’s still half speed.
The program starts in the main thread, calls some CudaMallocs from the main thread, then creates a subthread which does cuda stuff while I have the host thread looping idle, and it still slows down. The main thread then runs some cuda stuff, but not simultaneously to the subthread.
I wouldn’t mind switching to 4.0, but that’s tsill Beta, isn’t it?
They’re definately not running both on the quadro.
I moved it to OpenMP to see if it’d grab different cores, but I still need to put sleeps in the main thread to have the CUDA stuff run at full speed on the subthread.
On the main thread, there is some CUDA code being run though, on the Tesla… just not simultaneously to the CUDA stuff being run on the subthread.
Is it that switching contexts between the two threads causing the slow down? If that were it, why would sleeping the main thread make a difference?
However, it is a “hyperthreaded” processor or whatever, such that each core has “two threads” to make switching faster. could that be it? OpenMP tells me there are 16 processors available (when there are really 8). However, when I remove the sleep statements top shows me I have 200% CPU usag.e