TeraGrid SU

Hello Ladies and Gentlemen – I was wondering if anyone familiar with TeraGrid might be able to fathom a SU conversion value for a GPU-hour of work in a cluster environment?

I know there are lots of details that’d have specified, but I was just wondering if someone with experience might have a ballpark-range estimate.

Thanks!

For molecular dynamics ( http://www.ameslab.gov/hoomd ) a single Tesla GPU performs equivalently to ~30 CPU cores in a fast cluster. Most speedups I’ve seen reported are in a similar ballpark. Some algorithms that aren’t particularly well adapted to the data-parallel architecture are slower, while others are faster.

This thread: http://forums.nvidia.com/index.php?showtopic=48836 has some more examples, though most are comparing single/multi-core CPUs to a single GPU and not clusters.

If it were me, I wouldn’t necessarily set the SU cost based on the GPU’s speedup relative to the other cores in your cluster. What you really have is a capacity planning problem; you want your GPUs to be used with roughly the same duty cycle as the rest of your cluster. All other things being equal, that would suggest that a GPU-hour should be valued relative to a node-hour by a factor equal to the speedup, but there are several factors that would change that. For instance, if you have fewer GPUs than nodes, that would make GPU-hours relatively more valuable. Offsetting that, the relative newness of GPU programming will probably make some users reluctant to try it, which would tend to make GPU-hours relatively less valuable. My guess is that for now users’ resistance to change will dominate, meaning GPUs should cost less than the speedup ratio, at least until GPU programming is widely accepted amongst users.

If your users will tolerate tinkering with the SU cost structure, I would recommend starting with a reasonable guess and seeing what the queue for jobs requesting GPU resources looks like. Then adjust the cost over time until the utilization for the GPUs is approximately the same as the rest of the system.

[Edited to correct a small logic error in the original]

-rpl