More cores than GPUs

Could someone give me any ideas about how bad it is to have many CPUs independently sending requests to one GPU?

I saw a thread around here where someone mentioned it was not recommended, but another guy responded and said his less demanding application worked fine in this situation.

Is this something that becomes very significant very quickly? I’m assuming we’re going to have to have our memory coalescing/shared memory problems out of the way before we even worry about these program switches, or is this a bad thing to assume?

Does this work at 2 cores/GPU where it doesn’t work at 8 cores/GPU?


check out MisterAnderson42’s gpu worker

It depends if the CPUs are doing actual computation or just dispatching kernels. For the latter, you probably won’t see much benefit (maybe even degradation) over 1 cpu core to 1 gpu.

Yeah, that was probably me. I’ve since learned that, at least for now, you should try not to have separate processes using the same GPU, or you might trip over some driver bugs that will be fixed in a future CUDA release at some point. However, you could easily have one thread controlling the GPU and several CPU threads requesting GPU work to be done by the GPU-controlling thread.

Hi, you’ll likely benefit from reading Jim’s presentations on this topic as it pertains to our code NAMD:…PU-Phillips.pdf

Tach: On Page 17, would it be true to say that the black bars overall have more GPUs than the dark or light grey ones (while all bars have the same number of cores running – I suppose you just left some cores on each node idle)?

Also (and this may be difficult for you to even guess at) but how intensive are your GPU calls? I mean intensive in the sense of a matrix multiplication being intense, so that if two cpus were requesting MMs on an ideal GPU you’d get 2x longer runtime (and any degradation past that would be effects of sharing the GPU). At some point of lower intensity, you’d not get a performance drop because the GPU would have plenty of free time. The paper says you guys have some conditionals that break up the multiprocessors a bit. On a scale of 0 -> 10, what would you say your kernels are like?

How big are you kernels (per call)? It looks from earlier slides that the GPUs are running a tenth of a second/step, but is that just a few calls? Or is that a few hundred calls?

I think I’m asking worthwhile questions here… Tell me if I’m barking up the wrong tree (or tell me if I need to read the paper more closely :]).

Gatoat & Seibert: I fear this, but yeah, I see the argument for setting up this way. I’ll probably try overloading the GPUs first anyway :).