Dual GeForce GTX or Titan V on mobo, unable to display upon launching Ubuntu 18.04

Hi :

I have to put the above on hold.
and I am now installing nvidia driver for Tesla K40m.

So I went here:
https://www.nvidia.com/Download/index.aspx
Then select the appropriate choices to
get this:

Tesla Driver for Linux RHEL 7

Version: 418.67
Release Date: 2019.5.7
Operating System: Linux 64-bit RHEL7
CUDA Toolkit: 10.1
Language: English (US)
File Size: 154.4 MB

https://www.nvidia.com/Download/driverResults.aspx/146673/en-us

Download rpm, then do this install:

i) rpm -i nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64.rpm' ii) yum clean alliii)yum install cuda-driversiv)reboot`

But the nvidia bug report script results
indicate many problems with initialization.

I was able to run nvidia-smi to display.
But not able to run nvidia-settings command.

Attached is the nvidia-bug-report.log.gz
Please advice.

nvidia-bug-report.log.gz (1.2 MB)

According to the latest dmesg, the nvidia driver loads fine. nvidia-settings doesn’t work because the Xserver is running on the integrated ASPEED srver graphics, this is expected. nvidia-settings is a tool for graphics settings, not useful for Teslas.

The bug report is quite cryptic to me.
Is there any support docs to interpret the errors?

I have another problem here:

Thanks.

Stick with me please Gentlemen.

The best way to install Tensorflow is as a Docker image rather than systemwide… anyway.

So how does your above issue relate…
to the install error bug [executing grub-install /dev/sda failed 18.04 ]?

Both issues are always on AMD boards and are misinterpretation of hardware scanning.

I’m repeating what I read while setting up gpu passthrough for a friend a while ago.

I believe it is the right direction and on the right path.

Linux scans hardware addresses and ports backwards to Windows.
Sata Ports. PCIe ports. USB ports

Below is a mind rant I just had on Reddit on this issue.

With the nvidia installer installing the driver the “linux” way overrides the OS installer Calameres which translates it or compensates. If you dont install nvidia at OS install. It installs OK.

Back to the scanning.
When you had a Black screen. on your primary card if you plugged into the other it should work.
Linux scans the PCI-e from the “bottom” of a desktop Mboard where as Windown scans from the top PCIe port down.

Regardless of nvidia-driver-XXX or nvidia-driver-YYY.
Investigate the gpu-passthrough forums.
Actually its on the manjaro forum from a couple of months ago.
Same issue as yours.

I have concerns about:
I have
two GeForce GTX 1080 and

two Titan V,

all have 12 GB each.

and MOBO is AMD x399

OS is Ubuntu 18.04.

Is your gear actually fit for purpose?
Do you use SLI? with this setup. There no Graphic just compute. hey?

Hope that helps.
I gotta go to bed.

Hi generix:

Am back to work on my quad GPUs now…
Did a linux Mint update and it looks like
it also installed new Nvidia drivers:
~$ nvidia-smi
Wed Aug 7 14:30:17 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |

Question:
It the above Nvidia driver and Cuda version all good OR …
Should I repeat your suggestions, i.e. install Nvidia driver from PPA
and then do this: “sudo apt install cuda-toolkit-10-0” ?

Thanks.

The cuda version displayed just means “cuda toolkit up to cuda 10.1 supported with this driver”

Thank you.

Correct, not much graphics, just compute!
Am only using it for deep learning.

I like to use SLI for the two GEForce GTX for display,
since it looks like it has SLI connectors,
but it looks like it is complicated to setup on Linux Mint!
When I look here:

for MSI X399 mobo, I did not see GEFORCE GTX on the GPU boards column.

Why?

So does this means I can do this command:

sudo apt install cuda-toolkit-10-1

instead of this:

sudo apt install cuda-toolkit-10-0

This depends on which toolkit version you want to install/your application needs.

Hi generix:

I installed cuda 10.1.
When I ran deviceQuery from the compiled samples here:

$:~/NVIDIA_CUDA-10.1_Samples/1_Utilities/deviceQuery$ ./deviceQuery
Excerpts:
" …

Peer access from TITAN V (GPU0) → TITAN V (GPU1) : Yes
Peer access from TITAN V (GPU0) → GeForce GTX 1080 Ti (GPU2) : No
Peer access from TITAN V (GPU0) → GeForce GTX 1080 Ti (GPU3) : No
Peer access from TITAN V (GPU1) → TITAN V (GPU0) : Yes
Peer access from TITAN V (GPU1) → GeForce GTX 1080 Ti (GPU2) : No
Peer access from TITAN V (GPU1) → GeForce GTX 1080 Ti (GPU3) : No
Peer access from GeForce GTX 1080 Ti (GPU2) → TITAN V (GPU0) : No
Peer access from GeForce GTX 1080 Ti (GPU2) → TITAN V (GPU1) : No
Peer access from GeForce GTX 1080 Ti (GPU2) → GeForce GTX 1080 Ti (GPU3) : Yes
Peer access from GeForce GTX 1080 Ti (GPU3) → TITAN V (GPU0) : No
Peer access from GeForce GTX 1080 Ti (GPU3) → TITAN V (GPU1) : No
Peer access from GeForce GTX 1080 Ti (GPU3) → GeForce GTX 1080 Ti (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 4
Result = PASS "

Fromm above, it appears one GPU can access the other as long as they are of the same model.
What does this accessibility feature allow us to do and with what tools (Tensorflow?) ?

Thank you.

Which gpus can communicate peer-to-peer is not depending on the model but the pci bus layout of your mainboard, i.e. the slots the cards are in.
AFAIK, Tensorflow will make use p2p transfers if available. Using an sli bridge on the two GTX will increase p2p tranfer speeds. Forget about using sli for graphics/display, it’s broken and useless on Linux.

Hi generix:

OS is Linux Mint 19.x

Result of

$ nvidia-smi -l 1

is this:
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 1407 G /usr/lib/xorg/Xorg 113MiB |
| 3 1959 G cinnamon 45MiB |
±----------------------------------------------------------------------------+
Thu Aug 29 11:03:59 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:0B:00.0 Off | N/A |
| 0% 28C P8 11W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 00000000:0C:00.0 Off | N/A |
| 51% 31C P8 8W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
[ COMMENT: Why is the Titan V card below not consuming any power?]
| 2 TITAN V Off | 00000000:42:00.0 Off | N/A |
| 32% 48C P8 N/A / N/A | 0MiB / 12066MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 TITAN V Off | 00000000:43:00.0 On | N/A |
| 38% 55C P2 31W / 250W | 160MiB / 12065MiB | 0% Default |
±------------------------------±---------------------±---------------------+

For the 2nd Titan V card (see “Comment” above), I noticed there is no power
being consumed at all. I thought the GPU did not get power, but
it seem the nvidia-smi was able to read temperature on the Titan V card!

I also swapped the two Titan V cards with each PCIe slots and still got same result
at the second PCIe slot (similar output as above).
This means both Titan V cards are working!
Then I swapped the power cables, and still get same result!

However, for the two GeForce GTX GPUs. it shows both are consuming some power
and also use up some memory (even though I did not run anything on it)

NOTE: Only the first Titan card (BUS ID =00000000:43:00.0 Display=On) is being used!

Why is this “zero power consumption” as in “N/A / N/A”
happening to the Titan V card on the 2nd PCIe slot
and not happened to the GeForce GTX GPUs?

Is this a problem and if so, what do I need to do?

Is there a Nvidia utility tool I can run to do complete testing of all GPUs on the motherboard
so as to ensure they are all working?

Thanks.

This might or might not be a symptom of something serious. First make sure that the nvidia-persistenced is running as failing to do so can lead to all kinds of odd effects.
This might also be just some driver bug, IIRC this happened before.
For a reliable test, use gpu-burn.

HI Generix:

~$ ps aux | grep nvidia-persistenced
ml1 1286 0.0 0.0 17324 1668 ? Ss 08:32 0:00 /usr/bin/nvidia-persistenced --user ml1

nvidia-smi -l 1
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 1423 G /usr/lib/xorg/Xorg 127MiB |
| 3 1997 G cinnamon 46MiB |
±----------------------------------------------------------------------------+
Fri Aug 30 09:11:09 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… On | 00000000:0B:00.0 Off | N/A |
| 0% 27C P8 10W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… On | 00000000:0C:00.0 Off | N/A |
| 51% 31C P8 8W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
[ Comment: This Titan V is now ON, but still now power consumptuon! ]
| 2 TITAN V On | 00000000:42:00.0 Off | N/A |
| 34% 49C P8 N/A / N/A | 0MiB / 12066MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 TITAN V On | 00000000:43:00.0 On | N/A |
| 38% 57C P2 41W / 250W | 175MiB / 12065MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Recall I mentioned I swap the two Titan V just to make sure it works, and
both Titan v cards work on PCIe Bus ID 00000000:43:00.0 (i.e. GPU slot #3)
since it was the ONLY display PCIe slot.

Since GPU slot #2 now has power (as shown by nvidia-smi above)
after enabling nvidia-persistenced, but nvidia-smi still did not
display any power consumption (Pwr:Usage/Cap = N/A / N/A)
for this TItan V GPU on this PCIe slot bus ID = 00000000:42:00.0.

What does this nvidia-smi result mean, and what do I have to do next?

Thank you.

Probably some subtle driver bug. What’s the output of

cat /sys/bus/pci/devices/0000\:42\:00.0/power/control

?
Does it display any power usage when under load?
Otherwise, you could check if it’s a regression by installing the 418 driver or earlier (kernel <5.0).

我在ubuntu20.04LTS里运行了两张GPU,一张是gtx1060,和gtx 1080ti,是使用的ubuntu自带显卡驱动的最新版本,是可以正常运行的。我只要是运行davinci resolve studio16.2软件做视频调色剪辑。