One of the 4 GPUs (GeForce RTX 2080 Ti) does not show up on nvidia-smi

A new server just arrived and I proceeded to install Ubuntu 16.04 and CUDA+cuDNN as usual. After installing everything, one of the GPUs is missing from nvidia-smi. All 4 appear on lspci:

19:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev ff)
68:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)

When I run nvidia-smi, this message appears on dmesg:

[    8.672022] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window]
[    8.672138] caller os_map_kernel_space.part.7+0xd8/0x120 [nvidia] mapping multiple BARs
[   11.860957] NVRM: RmInitAdapter failed! (0x26:0xffff:1125)
[   11.860979] NVRM: rm_init_adapter failed for device bearing minor number 2

If I let the server running for some time, the nvidia-smi then throws:

Unable to determine the device handle for GPU 0000:67:00.0: Unknown Error

I attached two bug reports: one before the nvidia-smi “breaks” and another one after. Any help is welcome.
nvidia-bug-report-after.log.gz (2.06 MB)
nvidia-bug-report-before.log.gz (1.87 MB)

Some kind of hardware failure, maybe the card is just improperly seated or power connector missing/improperly connected. If reseating/checking power connectors doesn’t help, test card in another system for general hardware failure.

Indeed, after many attempts, I could identify that one of the GPUs was faulty but only after loading the NVIDIA driver. On Windows, what happened was that it would fallback to the generic driver so that the faulty GPU could be used when the display was connected to it; otherwise, same thing: it was reporting that something was wrong with one of the GPUs. Let’s see if we manage to RMA it.