we have some systems with multiple GTX 1080Ti and sometimes one of the cards hangs and it is not possible to use CUBLAS as a regular user
Driver Version: 384.90
CUDA: 9.0.176

When i run simpleCUBLAS as user i get the message CUBLAS_STATUS_NOT_INITIALIZED
When i run the same test as root the problem does not exist.

Is there any way i can debug it ?

Kind regards
University Düsseldorf

Is it always the same card that “hangs”, or is it a random one from among the several in the system?

How rigorous is the error checking performed by the application? The application should check the status of every CUDA API call, and of every call to a CUDA-accelerated API. Try running the application under control of cuda-memcheck to see whether any earlier errors are reported.

One scenario for a status of CUBLAS_STATUS_NOT_INITIALIZED would be that CUBLAS could not initialize because there wasn’t a valid CUDA context to start with. That in turn might have multiple reasons, one of which might be the current setting of CUDA_VISIBLE_DEVICES, or that a GPU has temporarily “fallen off the bus” (check relevant system logs to see whether any issues to GPUs are reported).

The root vs non-root discrepancy could be a red herring; how many experiments did you run to establish a firm correlation? It could also be that running as root vs non-root causes the app to be run with different environment settings, so scrutinize those.

How many GTX 1080 Ti are in this system, and what kind of system is it? What’s the wattage of the power supply in the machine?


i think that it happens to all cards in the system.

The system is a SuperMicro SYS-4028GR-TR2 with 10 GTX 1080Ti.

I tried to run the simpleCUBLAS example under cuda-memcheck:
[phreh100@hilbert210 ~]$ cuda-memcheck /software/CUDA/9.0.176/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS
GPU Device 0: “GeForce GTX 1080 Ti” with compute capability 6.1
simpleCUBLAS test running…
!!! CUBLAS initialization error
========= Program hit cudaErrorUnknown (error 30) due to “unknown error” on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/lib64/ [0x32f753]
========= Host Frame:/software/CUDA/9.0.176/lib64/ [0x404190]
========= Host Frame:/software/CUDA/9.0.176/lib64/ (cublasCreate_v2 + 0x28) [0x78f98]
========= Host Frame:/software/CUDA/9.0.176/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS [0x3e05]
========= Host Frame:/lib64/ (__libc_start_main + 0xf5) [0x21b35]
========= Host Frame:/software/CUDA/9.0.176/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS [0x3429]

========= ERROR SUMMARY: 1 error

There is no other process on any gpu. When i run the same command as root everything works fine without any errors. There is also no message of gpu has fallen off the bus.

Kind regards

The information I can find indicates that this system officially supports eight (not ten) GPUs, although it comes with 11 PCIe slots (interestingly enough all in a single PCIe root complex, presumably by using some sort of PCIe switch?)

Did you buy the system fully assembled with all the GPUs from a system integrator? If so, check with the system vendor on the issues observed. If you bought just the base system from SuperMicro, double check with them on the advisability of installing ten high-end GPUs.

If you installed all those GPUs after acquiring the base system, double check all connectors (PCIe, power) and upgrade to the latest system BIOS as due-diligence measures. Confirm that you did not use any Y-splitters or 6-pin to 8-pin converters in the PCIe power supply cables to the GPUs.

The system description says the PSU delivers 2000W (high-quality 80 PLUS Titanium, nice!). Is that total power? Or is the total power supply 4000W (it says x2, but not clear whether that means 2x1000W or 2x2000W). Each GTX 1080Ti is specified for 250W, so ten of those would require 2500W, plus about 250W for the CPUs / system memory / peripherals. PSUs should not be loaded much above 60% if you desire rock-solid operation, so we are looking at 4580W power supply desired, with 4000W being borderline if that is what you actually have.

Have you tried running with just eight GTX 1080 Ti cards to see whether this makes the problems go away?