Hello,
If an user is having multiple different GPUs, I would like to know which performance criteria i need to use for sharing tasks in the most efficient way?
Thanks
Hello,
If an user is having multiple different GPUs, I would like to know which performance criteria i need to use for sharing tasks in the most efficient way?
Thanks
This may be of interest. It mostly assumes multiple GPUs of the same type. Sharing a single task/job across multiple dissimilar GPUs is probably not a topic that receives much exploration.
You would have to find out, which main limitations the algorithm has (e.g. memory, bandwidth or compute). And then split the work accordingly. You could also program automated measurements = to optimize the task distribution, if the tasks are repeated.
Sharing a task across multiple different GPUs is something I would avoid, because it gives rise to gnarly management problems.
An easier usage model would be to assign different tasks from shared queue(s) to these GPUs, taking into account their different capabilities, with each GPU working on one particular task at a time. This would be akin to the way Folding@Home uses multiple GPUs, or how a load sharing and balancing software like LSF parcels out work to different machines.
But what about ratioing with the SM number of each GPU?
I imagine you could do that, perhaps. The ratio of the number of SMs is to some degree correlated with performance, but the correlation might not apply in each case. GPUs of different architectures might impact things. For example, an Ampere class GPU with 40 SMs might be more performant than a Pascal class GPU with 60 SMs. Furthermore, other factors like the tensorcore throughput, or the memory bandwidth, might be more accurate ratios or predictors of delivered performance.
The question comes up from time to time as to whether or not GPU performance can be distilled down to a single number or metric, useful for comparisons. I would say its hard to do in the general case (i.e. with no constraints and no information about the use-case). But on the other hand, sometimes people do this, for example 3DMark is a benchmark that attempts to rank GPUs (or, more accurately systems, but in this case we could posit the same system housing the comparison GPUs) based on their graphics performance.
Indeed, i was thinking that SM might not be the only factor. So which formula (function of SM, bandwidth,…)
should i use as a criteria for the task pro-rata?
It really depends on your kernel. First you have to consider, whether the algorithm runs at all on each GPU (memory size or special needs like Tensor cores not available before Volta or small floating point formats introduced at later architectures). Then whether your algorithm uses FP64 (datacenter cards preferred).
After that the number of SMs is not so bad as criterion. Memory bandwidth and compute do not scale fully independently - Nvidia tries to balance those for each model. Newer generations get a newer faster memory version. Higher-end cards in each generation get a wider (bits) memory interface.
You could create a short test kernel, with which you can quickly benchmark each GPU in production, even if it runs for just 10ms, you get a better ratio estimation than with SM alone. And you can dynamically use the results to assign workloads.
Or you create a table of all possible GPUs and their relative speeds for your algorithm.
As a variation on this scheme, one can start with very rough estimates for execution time on the various GPUs, and then fine-tune this automatically by tracking the actual execution times of all submitted tasks. If you add some PID-controller like logic such a system can even adapt to performance drift over time. Add some AI components if it needs to be really fancy.
For scientific computing world, do you know if customers are using the same exact GPUs in general? And what do you think if i use the following criteria for the ratio : #SM x CUDA Cores x Clock speed?
CUDA cores is only relevant, if your algorithm is limited by compute, specifically by FP32 compute without tensor cores.
#SM and CUDA Cores is typically proportional, so only one should be used.
A good metric could be sqrt(#SM x #CUDA Cores) x Clock Speed
.
Or use a factor on top of #SM:
7.0-8.0: 64 FP32 + 64 INT32 → Factor 0.75
8.6-8.7: 64 FP32 + 64 FP32/INT32 → Factor 0.9
8.9-9.0: 128 FP32 + 64 INT32 → Factor 0.9
10.0-12.0: 64 FP32/INT32 + 64 FP32/INT32 → Factor 1.0
You could also consider L2 cache size, which changed over the architectures.
Factor x #SM x Clock Speed
Thanks ! You were right when you were saying that it will depend on my kernel usage on FP64/32…
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.