System uses only 1 out of 4 GPUs at a time on Azure NC instance

Hi,
My system description is as follows:
4 x Tesla K80
5 x Cuda worker processes per GPU

All processes are properly distributed between GPUs
All processes receive jobs uniformly (round-robin)
All GPUs running in Persistance mode (don’t think it is related)

Yet, my system still uses one GPU at a time

From analyzing the output from nvidia-smi I see that every device gets PCI device 0x00
dmesg prints errors that indicate a similar error

Here is the link to nvidia-bug-report.log
[url]https://api.cacher.io/raw/0ae82e2cd05ff9fa5a86/097448714c820dffe0f5/nvidia-bug-report.log[/url]

Thanks,
Doron

How are you accomplishing that, specifically?

I assume your application is designed to use 1 GPU. Unless your application has a specific method (e.g. a command line switch) to direct it to use other than the default GPU (GPU 0), and/or you are using an explicit method to steer the application to use other than the default GPU (e.g. CUDA_VISIBLE_DEVICES) then it is expected behavior that all processes would attempt to use GPU 0.

Maybe I didn’t explain myself well enough

I did it by using cudaSetDevice
Basically I have a code that makes sure that every GPU gets exactly 5 processes

by saying “All processes are properly distributed between GPUs” I meant that I’ve checked that code
using nvidia-smi and I see the processes allocated correctly to each of the GPUs
5 for each GPU

This exact configuration is already running well (all GPUs are 100% utilized) on a similar AWS instance
the difference between the two (AWS, Azure) is that in AWS every GPU gets a valid PCI device

I’ve looked at your bug report log, and I see 10 processes per GPU.

How are you making that determination? It appears that processes are distributed to each GPU.

I don’t see errors in the dmesg output in the bug report log file. I may have missed it. Could you point out what you are referring to?

Furthermore, when considering the full PCI address of each device:

https://wiki.xenproject.org/wiki/Bus:Device.Function_(BDF)_Notation

they appear to be unique:

Attached GPUs : 4
GPU 00009EB9:00:00.0
GPU 0000B5B3:00:00.0
GPU 0000CBFB:00:00.0
GPU 0000E81C:00:00.0

Yes, they all have the same PCI bus, device ID. I don’t see any indication that is a problem. In this case, the GPUs are all on separate PCI domains, so it appears OK to me. This could certainly vary from system to system, it depends on the PCI bus topology and enumeration order.

Finally, I note this in the log file:

[ 180.235947] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.73 Mon Aug 21 14:47:41 PDT 2017 (using threaded interrupts)

elsewhere it appears you are using 387.26. I don’t think its an issue, but for a clean system I wouldn’t expect to see references to another driver. I would have to parse exactly what the bug reporter was looking at there.

That’s correct.

By Using nvidia-smi, monitoring GPU usage for each GPU
I see one GPU in use at a time, not necessarily the same one but only one being utilized.

As part of troubleshooting this problem I tried to use the official driver version for Tesla K80 thinking it may be the problem
But it didn’t solve the problem and currently I’m back with the driver that comes with Cuda 9.1 - 387.26

These lines:

[   30.846500] nvidia 9eb9:00:00.0: can't derive routing for PCI INT A
[   30.846502] nvidia 9eb9:00:00.0: PCI INT A: no GSI
[   30.859498] nvidia b5b3:00:00.0: can't derive routing for PCI INT A
[   30.859499] nvidia b5b3:00:00.0: PCI INT A: no GSI
[   30.869178] nvidia cbfb:00:00.0: can't derive routing for PCI INT A
[   30.869178] nvidia cbfb:00:00.0: PCI INT A: no GSI
[   30.881454] nvidia e81c:00:00.0: can't derive routing for PCI INT A
[   30.881455] nvidia e81c:00:00.0: PCI INT A: no GSI

And These:

[  367.642468] nvidia-modeset: Allocated GPU:0 (GPU-9e750155-112c-7c87-e35b-0c9f97d7b45e) @ PCI:9eb9:00:00.0
[  367.642719] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  367.642846] nvidia-modeset: Freed GPU:0 (GPU-9e750155-112c-7c87-e35b-0c9f97d7b45e) @ PCI:9eb9:00:00.0
[  367.642996] nvidia-modeset: Allocated GPU:0 (GPU-9e750155-112c-7c87-e35b-0c9f97d7b45e) @ PCI:9eb9:00:00.0
[  367.643085] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  367.643119] nvidia-modeset: Freed GPU:0 (GPU-9e750155-112c-7c87-e35b-0c9f97d7b45e) @ PCI:9eb9:00:00.0
[  391.678945] nvidia-modeset: Allocated GPU:0 (GPU-eb532b16-1499-0bdb-f075-58c949bbd9ba) @ PCI:b5b3:00:00.0
[  391.679101] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  391.679165] nvidia-modeset: Freed GPU:0 (GPU-eb532b16-1499-0bdb-f075-58c949bbd9ba) @ PCI:b5b3:00:00.0
[  391.679301] nvidia-modeset: Allocated GPU:0 (GPU-eb532b16-1499-0bdb-f075-58c949bbd9ba) @ PCI:b5b3:00:00.0
[  391.679434] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  391.679485] nvidia-modeset: Freed GPU:0 (GPU-eb532b16-1499-0bdb-f075-58c949bbd9ba) @ PCI:b5b3:00:00.0
[  415.326827] nvidia-modeset: Allocated GPU:0 (GPU-8c85039d-caaf-9efe-763b-302f00314522) @ PCI:cbfb:00:00.0
[  415.327368] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  415.327456] nvidia-modeset: Freed GPU:0 (GPU-8c85039d-caaf-9efe-763b-302f00314522) @ PCI:cbfb:00:00.0
[  415.327907] nvidia-modeset: Allocated GPU:0 (GPU-8c85039d-caaf-9efe-763b-302f00314522) @ PCI:cbfb:00:00.0
[  415.328399] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  415.328471] nvidia-modeset: Freed GPU:0 (GPU-8c85039d-caaf-9efe-763b-302f00314522) @ PCI:cbfb:00:00.0
[  463.540303] nvidia-modeset: Allocated GPU:0 (GPU-d20aa673-b44e-62ef-abee-803ed9d4f79f) @ PCI:e81c:00:00.0
[  463.540870] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  463.541144] nvidia-modeset: Freed GPU:0 (GPU-d20aa673-b44e-62ef-abee-803ed9d4f79f) @ PCI:e81c:00:00.0
[  463.542874] nvidia-modeset: Allocated GPU:0 (GPU-d20aa673-b44e-62ef-abee-803ed9d4f79f) @ PCI:e81c:00:00.0
[  463.543107] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  463.543203] nvidia-modeset: Freed GPU:0 (GPU-d20aa673-b44e-62ef-abee-803ed9d4f79f) @ PCI:e81c:00:00.0

But if you say that for the four GPU having the same Device ID is OK then it maybe something else

I don’t think any of the things you are pointing out are actual problems.

If you are observing that all GPUs are being utilized at different points, then I think the problem is elsewhere. If the GPUs are working at all, the interrupts are being handled correctly. In my experience, the GPU driver generally chooses to use PCIE Message-signaled-interrupts (MSI) and so routing of the PCI INTA signal shouldn’t matter, I don’t think. Here’s an excerpt from your bug report log:

*** ls: -r--r--r-- 1 root root 0 2018-01-01 12:18:31.481520765 +0000 /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15      CPU16      CPU17      CPU18      CPU19      CPU20      CPU21      CPU22      CPU23      
...
  24:        198          0          0          0          0          0          0         38          0          0          0    2199029          0          0          0          0          0          0          0          0          0          0          0          0  Hyper-V PCIe MSI -939524096-edge      nvidia
  25:          1          0          0          0          0          0          0          0          0   17781407          0        213          0          0          0          0          0          0          0          0          0          0          0          0  Hyper-V PCIe MSI -1744830464-edge      nvidia
  26:          1          0          0          0          0    2193609          0        206          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  Hyper-V PCIe MSI -671088640-edge      nvidia
  27:          1          0          0          0          0          0          0    2195901          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0  Hyper-V PCIe MSI -536870912-edge      nvidia

I would devise a simpler test to determine whether or not two processes can use 2 separate GPUs at the same time (it should be possible). This might be as simple as launching 2 instances of a CUDA sample code, one with CUDA_VISIBLE_DEVICES=0 and one with CUDA_VISIBLE_DEVICES=1

If that works, then the problem may be due to some characteristic of your code.

Are you using CUDA MPS at all, in this setup?

If not, processes sharing the same GPU will definitely serialize.

Thank you very much for your help!
I’ll sure do so
I’ll also check if I currently use MPS as well