I think I managed to run my job using 2 GPUs. I tried to benchmark 2 jobs - 1st one running 1 GPU, and 2nd one with 2 GPUs. The step size is 10,000.
Both took around the same time using 1.5hrs.
The loss for the 2nd case is lower, but not by much -
[step: 10000] loss: 2.716e-02, time/iteration: 2.154e+02 ms
[step: 10000] loss: 3.458e-02, time/iteration: 2.006e+02 ms
In the 2nd job, there’s a msg which seems to imply I’m using 2 GPUs:
Initialized process 0 of 2 using method “openmpi”. Device set to cuda:0
Initialized process 1 of 2 using method “openmpi”. Device set to cuda:1
So am I using 2 GPUs in the 2nd job? Why is it that both job took the same time to complete?
it is mentioned that:
This data parallel fashion of multi-GPU training keeps the number of points sampled per GPU constant while increasing the total effective batch size. You can use this to your advantage to increase the number of points sampled by increasing the number of GPUs allowing you to handle much larger problems.
Is this what is happening now? So in a multi-GPU run, I am actually using twice the batch size in total for 2 GPUs, is this correct? Hence, the time taken will be the same.
Please clarify. Thank you.