T4 on Ubuntu 18.04

I have been working with a T4 card in my Lenovo P330 development box for a while without any issues but it fails to start now I have moved it into the production system which is an industrial PC based on a DFI CS630-Q370.

Both systems are running vanilla Ubuntu 18.04.x server with latest patches. Cuda packages installed via the “deb (network)” procedure.

It appears to be a PCIe memory mapping issue which then causes the nvidia-persistenced.service to fail to start:

$ systemctl status nvidia-persistenced.service
● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
   Active: failed (Result: start-limit-hit) since Tue 2019-06-04 14:08:47 UTC; 8s ago
  Process: 9586 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/* (code=exited, status=0/SUCCESS)
  Process: 9581 ExecStart=/usr/bin/nvidia-persistenced --verbose (code=exited, status=0/SUCCESS)

Jun 04 14:08:54 db-raillab systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jun 04 14:08:54 db-raillab systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jun 04 14:08:54 db-raillab systemd[1]: nvidia-persistenced.service: Failed with result 'start-limit-hit'.
Jun 04 14:08:54 db-raillab systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jun 04 14:08:55 db-raillab systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jun 04 14:08:55 db-raillab systemd[1]: nvidia-persistenced.service: Failed with result 'start-limit-hit'.
Jun 04 14:08:55 db-raillab systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jun 04 14:08:55 db-raillab systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Jun 04 14:08:55 db-raillab systemd[1]: nvidia-persistenced.service: Failed with result 'start-limit-hit'.
Jun 04 14:08:55 db-raillab systemd[1]: Failed to start NVIDIA Persistence Daemon.

In dmesg I am seeing a continuous stream of:

[  556.310812] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[  556.310988] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
[  556.310988] NVRM: The system BIOS may have misconfigured your GPU.
[  556.310990] nvidia: probe of 0000:01:00.0 failed with error -1
[  556.310996] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  556.310997] NVRM: None of the NVIDIA graphics adapters were initialized!
[  556.311063] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238

In lspci I get:

01:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
        Subsystem: NVIDIA Corporation Device 12a2
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at 93000000 (32-bit, non-prefetchable) 
        Memory at <ignored> (64-bit, prefetchable)
        Memory at 94000000 (64-bit, prefetchable) 
        Capabilities: <access denied>
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Any suggestions?

Regards,

Geoff

T4 is only designed to be used in a system that has been properly qualified by the OEM for its use. You can’t just plug it into any PC and expect proper operation.

That’s obviously very disappointing to hear after we have already paid for two of them! The only compatibility specs I was aware of was PCIe gen 3 and 70W power/cooling and we are good on all of those as far as I am aware. We were looking for the maximum number of CUDA cores in a 2U chassis.

No mention of any such restrictions from your “Elite Solution Provider” where we bought them:

https://www.scan.co.uk/business/nvidia-tesla-t4

The PC used is standard for the industry and mandated by the customer so no flexibility there.

Looking at the errors it seems like it might be a BIOS limitation?

Any suggestions on how we might move forward from here?,

Geoff

NVIDIA doesn’t support the use of Tesla GPUs in platforms that were not designed and properly qualified for their use.

https://www.nvidia.com/en-us/data-center/tesla/tesla-qualified-servers-catalog/

Given that statement, the options I can imagine would be to work with DFI to request that their system be certified by NVIDIA for use of T4, or to choose another system that is already certified for T4.

But NVIDIA doesn’t provide support for inquiries such as what you are asking here.

Yes, it appears to be a BIOS issue, but that’s hard to confirm on a forum thread. There may be BIOS settings that affect this. And even if you work past that issue on your own:

  1. There might be other issues on the platform, perhaps less obvious, such as cooling issue. Tesla GPUs generally require flow-through cooling provided by the system. The card has no fan and does not keep itself cool. This also requires integration into a server BMC to monitor the GPU temperature and adjust fan speed accordingly. An “off the shelf” PC has no such knowledge or design ability to do this. And there may be other issues.

  2. Even if all the above were addressed, the system is still not certified for use with T4 (AFAIK). It would still not be supported by NVIDIA for use of T4 in any way.

If DFI were to claim that the system is certified for use of T4, then you should address your support questions to the vendor that sold you the system or cards first. That is the Tesla support model - support starts with the place that you purchased the certified system.

Unfortunately that information isn’t presented anywhere in you “Elite Solution Provider”'s sales process.

Looks like we have spent £5000 on two paperweights!

Feeling thoroughly let down by NVidia here.

What is the best card in your range for using in any vanilla 2U PCIe gen3 system to give maximum cuda cores?

Thanks,

Geoff

There is no GPU sold by NVIDIA that is guaranteed to work in any system you plug it into.

Having said that, you may wish to investigate the Quadro line of GPUs. They have their own integrated fan, so they will attempt to keep themselves cool, for example. There may still be compatibility issues, of course. If you want maximum CUDA cores, then look at the larger GPUs in that line. Power requirements, physical size, system/bios compatibility, and cooling airflow are all things that still need to be considered, for a proper application.

Whether or not they work in your PC, or meet your application requirements, I cannot say. My recommendation would be to work with your solution provider to get this resolved.