Problems with Azure with K80 GPGPU

Dear All,

Microsoft Azure
Ubuntu 20.04.
Virtualized K80 GPGPU
EGL Drivers
Nvidia Driver of CUDA 11.4.4

I am getting problems with the virtualized K80 in Microsoft Azure. In CUDA sometimes, it do not synchronizes after a kernel or fails in a GPU memory allocation (with plenty of memory).

It works after a newly driver installation (of CUDA 11.4.4) and a reboot.

Some suggestion?

Thanks,

Luís Gonçalves

Please enable the nvidia-persistenced to start on boot, make sure it is continuously running and check if that resolves the issue.

When I run the program in foreground I not noticed problems. But when I run in background it had problems. It happens in the middle of the running. Previous CUDA operations run ok but then start giving problems.

All with Persistent on.

The program is run with a os.system in Python. I put the command as “sudo nvidia-smi -pm ENABLED && ./program &” and it worked. The user do not uses password for sudo.

I can not see the output/print of that command.

I do not know if it is a coincidence.

The procedure in last post does not work everytime. The solution to put the persistence in the boot worked now 3 times in a row.

Ubuntu 20.04

/etc/rc.local


#!/bin/bash
sudo nvidia-smi -pm ENABLED


Do not forget to enable execute permission on rc.local.


Thanks for the solution.