GPU for clouds Few considerations


Some of u might be knowing the HooPoe initiative from “GASS”.

The bottomline is GPU on clouds is a possibility and may make sense in these recession periods.

However, there are some hurdles for GPU compute which I would like to bring up here.

  1. GPUs cant be virtualized unlike CPUs. Even if virtualized, the entire GPU has to be allocated to a VM.
    JOB Submissions have to be in the form of CUBIN and an EXE/a.out that loads it.
    But that opens up a security issue. There is no guarantee that the EXE/a.out will actually load that CUBIN.

  2. Say we have a workaround for 1, How do we ensure that the CUBIN is a good one. What if it runs deadlocks or runs indefinitely.
    How can I reset the GPU and make it available for others?
    In that sense, It would make sense to have a program-settable timeout for kernels.
    Display running GPUs have this 5-sec watchdog thing to save us. However other GPUs dont. It would be good to specify the time-out as a
    kernel launch parameter.
    May be, if GPUs can run under VMs – the whole instance can be rebooted after an externally maintained timeout. But till now, I have not heard any1
    running GPUs under VMs. On physical machines, only a reboot is possible today - as I understand.

Any thoughts?

Best Regards,

Hmm, I think both of these problems would be abstracted away just like they are for the CPU case… a virtual machine host.
The virtual machine however must have CUDA support, which is possible but not common yet… Parallels and VMWare both have “pro” versions that support CUDA.

Once you have that, you can let the app do whatever it wants to safely. You don’t need to trust it or worry about deadlocks or proper cubins, that’s the app’s problem.
The virtual machine can be shut down and that releases the GPU hardware even if it’s locked up. [Well, that’s the theory! If it doesn’t, it’s a bug…]

Thanks for the link…

VT-D enabled CPUs and Motherboards can help guest OS see real hardware… XEN and other VM providers have been tinkering with it (VMware espcially for network speed)

But say, they did the same to the GPU card. And there is this misbehaving kernel… I stop the VM.

So, the HYPERVISOR needs a way to reset the card and bring it back to normal. This, I think, is not possible without physically rebooting the machine. If it were there in the driver, we could use it now… on our PCs, right?

If NVIDIA could give this feature in their driver, then, we can easily stop misbehaving kernels from hijacking the GPU card…

I might be wroing here… or probably unaware of this feature that is already there. Any thoughts?

I think that ability is there for VMs… since NVidia cooperated with the VM vendors to make it work. Its likely why you have to assign an entire GPU to a VM… no sharing or time slicing, so a hung GPU doesn’t affect the host at all, it’s entirely assigned to the VM software.

Of course I have never USED these VMs with CUDA, but this what NV described anyway. It would definitely get messy if the host shared the GPU with a VM.

Very interesting, if that were the case.

If NV could release that feature in the driver via some API calls & support-applications, it would be useful to kill misbehaving kernels from command prompt.

My understand is that – if a kernel deadlocks on a non-display GPU – it is now locked forever. Is that right?