Tach: On Page 17, would it be true to say that the black bars overall have more GPUs than the dark or light grey ones (while all bars have the same number of cores running – I suppose you just left some cores on each node idle)?
Also (and this may be difficult for you to even guess at) but how intensive are your GPU calls? I mean intensive in the sense of a matrix multiplication being intense, so that if two cpus were requesting MMs on an ideal GPU you’d get 2x longer runtime (and any degradation past that would be effects of sharing the GPU). At some point of lower intensity, you’d not get a performance drop because the GPU would have plenty of free time. The paper says you guys have some conditionals that break up the multiprocessors a bit. On a scale of 0 -> 10, what would you say your kernels are like?
How big are you kernels (per call)? It looks from earlier slides that the GPUs are running a tenth of a second/step, but is that just a few calls? Or is that a few hundred calls?
I think I’m asking worthwhile questions here… Tell me if I’m barking up the wrong tree (or tell me if I need to read the paper more closely :]).
Gatoat & Seibert: I fear this, but yeah, I see the argument for setting up this way. I’ll probably try overloading the GPUs first anyway :).