Some of u might be knowing the HooPoe initiative from “GASS”.
The bottomline is GPU on clouds is a possibility and may make sense in these recession periods.
However, there are some hurdles for GPU compute which I would like to bring up here.
GPUs cant be virtualized unlike CPUs. Even if virtualized, the entire GPU has to be allocated to a VM.
JOB Submissions have to be in the form of CUBIN and an EXE/a.out that loads it.
But that opens up a security issue. There is no guarantee that the EXE/a.out will actually load that CUBIN.
Say we have a workaround for 1, How do we ensure that the CUBIN is a good one. What if it runs deadlocks or runs indefinitely.
How can I reset the GPU and make it available for others?
In that sense, It would make sense to have a program-settable timeout for kernels.
Display running GPUs have this 5-sec watchdog thing to save us. However other GPUs dont. It would be good to specify the time-out as a
kernel launch parameter.
May be, if GPUs can run under VMs – the whole instance can be rebooted after an externally maintained timeout. But till now, I have not heard any1
running GPUs under VMs. On physical machines, only a reboot is possible today - as I understand.