We are testing the performance of modulus on our cluster with multiple GPUs. In this regard, we ran the basic wave_equation tutorial with 10000 batch points on one GPU and two GPUs. Surprisingly the time consumed by a single GPU is 45 min and on two GPUs it is 50 min. The GPUs are v100. please see the attachment with the GPU activity for two GPUs. The modulus is installed with docker. I don’t think this is supposed to happen. Can anyone give me any suggestions, please? Thanks.
In Modulus the batch size defined in the config are by default the local (per GPU) batch size. What this means is that if you keep the batch size the same between 1 and 2 GPUs is that you’ve gone from a global batch size of 10000 → 20000.
Thus this would be a weak scaling test and the best you could ask for is the same performance speed between 1->2 GPUs. This of course won’t happen because to the communication between the GPUs adding some overhead. The exact overhead depends on how the hardware you have and how its configured, the size of the model, etc…
For strong scaling tests, Reduce your batch size when running on 2 GPUs to 5000. But keep in mind that if your GPUs are fully saturated with the smaller batch size you’re not going to see ideal scaling.
I was initially confused by the same sort of tests, until I understood the scaling system, as @ngeneva explained it.
Other things I noticed is that we weren’t using anywhere close to the total ram available on a single GPU yet.
So increasing batch-size(s), and decreasing # of iterations, was the first step to training a solution to the same accuracy in less time.
After that, increasing GPU # increased accuracy for the same configuration, such that we could lower the # of iterations again for the same accuracy, but I haven’t found the sweet-spot yet