Problems with Azure with K80 GPGPU

luisgoje9hl · March 24, 2022, 4:52pm

Dear All,

Microsoft Azure
Ubuntu 20.04.
Virtualized K80 GPGPU
EGL Drivers
Nvidia Driver of CUDA 11.4.4

I am getting problems with the virtualized K80 in Microsoft Azure. In CUDA sometimes, it do not synchronizes after a kernel or fails in a GPU memory allocation (with plenty of memory).

It works after a newly driver installation (of CUDA 11.4.4) and a reboot.

Some suggestion?

Thanks,

Luís Gonçalves

generix · March 24, 2022, 4:59pm

Please enable the nvidia-persistenced to start on boot, make sure it is continuously running and check if that resolves the issue.

luisgoje9hl · March 24, 2022, 6:30pm

When I run the program in foreground I not noticed problems. But when I run in background it had problems. It happens in the middle of the running. Previous CUDA operations run ok but then start giving problems.

All with Persistent on.

luisgoje9hl · March 24, 2022, 6:58pm

The program is run with a os.system in Python. I put the command as “sudo nvidia-smi -pm ENABLED && ./program &” and it worked. The user do not uses password for sudo.

I can not see the output/print of that command.

I do not know if it is a coincidence.

luisgoje9hl · March 25, 2022, 11:38am

The procedure in last post does not work everytime. The solution to put the persistence in the boot worked now 3 times in a row.

Ubuntu 20.04

/etc/rc.local

#!/bin/bash
sudo nvidia-smi -pm ENABLED

Do not forget to enable execute permission on rc.local.

Thanks for the solution.