Dual GeForce GTX or Titan V on mobo, unable to display upon launching Ubuntu 18.04

iLikeGPUs · July 24, 2019, 3:04am

Hi :

I have to put the above on hold.
and I am now installing nvidia driver for Tesla K40m.

So I went here:
https://www.nvidia.com/Download/index.aspx
Then select the appropriate choices to
get this:

Tesla Driver for Linux RHEL 7

Version: 418.67
Release Date: 2019.5.7
Operating System: Linux 64-bit RHEL7
CUDA Toolkit: 10.1
Language: English (US)
File Size: 154.4 MB

https://www.nvidia.com/Download/driverResults.aspx/146673/en-us

Download rpm, then do this install:

i) rpm -i nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64.rpm' ii) yum clean alliii)yum install cuda-driversiv)reboot`

But the nvidia bug report script results
indicate many problems with initialization.

I was able to run nvidia-smi to display.
But not able to run nvidia-settings command.

Attached is the nvidia-bug-report.log.gz
Please advice.

nvidia-bug-report.log.gz (1.2 MB)

generix · July 24, 2019, 9:24am

According to the latest dmesg, the nvidia driver loads fine. nvidia-settings doesn’t work because the Xserver is running on the integrated ASPEED srver graphics, this is expected. nvidia-settings is a tool for graphics settings, not useful for Teslas.

iLikeGPUs · July 25, 2019, 11:44pm

The bug report is quite cryptic to me.
Is there any support docs to interpret the errors?

I have another problem here:

Thanks.

username.murphy · July 26, 2019, 6:21pm

Stick with me please Gentlemen.

The best way to install Tensorflow is as a Docker image rather than systemwide… anyway.

So how does your above issue relate…
to the install error bug [executing grub-install /dev/sda failed 18.04 ]?

Both issues are always on AMD boards and are misinterpretation of hardware scanning.

I’m repeating what I read while setting up gpu passthrough for a friend a while ago.

I believe it is the right direction and on the right path.

Linux scans hardware addresses and ports backwards to Windows.
Sata Ports. PCIe ports. USB ports

Below is a mind rant I just had on Reddit on this issue.

With the nvidia installer installing the driver the “linux” way overrides the OS installer Calameres which translates it or compensates. If you dont install nvidia at OS install. It installs OK.

Back to the scanning.
When you had a Black screen. on your primary card if you plugged into the other it should work.
Linux scans the PCI-e from the “bottom” of a desktop Mboard where as Windown scans from the top PCIe port down.

Regardless of nvidia-driver-XXX or nvidia-driver-YYY.
Investigate the gpu-passthrough forums.
Actually its on the manjaro forum from a couple of months ago.
Same issue as yours.

I have concerns about:
I have
two GeForce GTX 1080 and

two Titan V,

all have 12 GB each.

and MOBO is AMD x399

OS is Ubuntu 18.04.

Is your gear actually fit for purpose?
Do you use SLI? with this setup. There no Graphic just compute. hey?

Hope that helps.
I gotta go to bed.

iLikeGPUs · August 7, 2019, 8:34pm

Hi generix:

Am back to work on my quad GPUs now…
Did a linux Mint update and it looks like
it also installed new Nvidia drivers:
~$ nvidia-smi
Wed Aug 7 14:30:17 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
…

Question:
It the above Nvidia driver and Cuda version all good OR …
Should I repeat your suggestions, i.e. install Nvidia driver from PPA
and then do this: “sudo apt install cuda-toolkit-10-0” ?

Thanks.

generix · August 7, 2019, 8:51pm

The cuda version displayed just means “cuda toolkit up to cuda 10.1 supported with this driver”

iLikeGPUs · August 7, 2019, 9:00pm

Thank you.

Correct, not much graphics, just compute!
Am only using it for deep learning.

I like to use SLI for the two GEForce GTX for display,
since it looks like it has SLI connectors,
but it looks like it is complicated to setup on Linux Mint!
When I look here:

for MSI X399 mobo, I did not see GEFORCE GTX on the GPU boards column.

Why?

iLikeGPUs · August 7, 2019, 10:17pm

So does this means I can do this command:

sudo apt install cuda-toolkit-10-1

instead of this:

sudo apt install cuda-toolkit-10-0

generix · August 8, 2019, 6:34am

This depends on which toolkit version you want to install/your application needs.

iLikeGPUs · August 8, 2019, 10:17pm

Hi generix:

I installed cuda 10.1.
When I ran deviceQuery from the compiled samples here:

$:~/NVIDIA_CUDA-10.1_Samples/1_Utilities/deviceQuery$ ./deviceQuery
Excerpts:
" …

Peer access from TITAN V (GPU0) → TITAN V (GPU1) : Yes
Peer access from TITAN V (GPU0) → GeForce GTX 1080 Ti (GPU2) : No
Peer access from TITAN V (GPU0) → GeForce GTX 1080 Ti (GPU3) : No
Peer access from TITAN V (GPU1) → TITAN V (GPU0) : Yes
Peer access from TITAN V (GPU1) → GeForce GTX 1080 Ti (GPU2) : No
Peer access from TITAN V (GPU1) → GeForce GTX 1080 Ti (GPU3) : No
Peer access from GeForce GTX 1080 Ti (GPU2) → TITAN V (GPU0) : No
Peer access from GeForce GTX 1080 Ti (GPU2) → TITAN V (GPU1) : No
Peer access from GeForce GTX 1080 Ti (GPU2) → GeForce GTX 1080 Ti (GPU3) : Yes
Peer access from GeForce GTX 1080 Ti (GPU3) → TITAN V (GPU0) : No
Peer access from GeForce GTX 1080 Ti (GPU3) → TITAN V (GPU1) : No
Peer access from GeForce GTX 1080 Ti (GPU3) → GeForce GTX 1080 Ti (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 4
Result = PASS "

Fromm above, it appears one GPU can access the other as long as they are of the same model.
What does this accessibility feature allow us to do and with what tools (Tensorflow?) ?

Thank you.

generix · August 9, 2019, 12:09pm

Which gpus can communicate peer-to-peer is not depending on the model but the pci bus layout of your mainboard, i.e. the slots the cards are in.
AFAIK, Tensorflow will make use p2p transfers if available. Using an sli bridge on the two GTX will increase p2p tranfer speeds. Forget about using sli for graphics/display, it’s broken and useless on Linux.

iLikeGPUs · August 29, 2019, 5:38pm

Hi generix:

OS is Linux Mint 19.x

Result of

$ nvidia-smi -l 1

For the 2nd Titan V card (see “Comment” above), I noticed there is no power
being consumed at all. I thought the GPU did not get power, but
it seem the nvidia-smi was able to read temperature on the Titan V card!

I also swapped the two Titan V cards with each PCIe slots and still got same result
at the second PCIe slot (similar output as above).
This means both Titan V cards are working!
Then I swapped the power cables, and still get same result!

However, for the two GeForce GTX GPUs. it shows both are consuming some power
and also use up some memory (even though I did not run anything on it)

NOTE: Only the first Titan card (BUS ID =00000000:43:00.0 Display=On) is being used!

Why is this “zero power consumption” as in “N/A / N/A”
happening to the Titan V card on the 2nd PCIe slot
and not happened to the GeForce GTX GPUs?

Is this a problem and if so, what do I need to do?

Is there a Nvidia utility tool I can run to do complete testing of all GPUs on the motherboard
so as to ensure they are all working?

Thanks.

generix · August 29, 2019, 7:02pm

This might or might not be a symptom of something serious. First make sure that the nvidia-persistenced is running as failing to do so can lead to all kinds of odd effects.
This might also be just some driver bug, IIRC this happened before.
For a reliable test, use gpu-burn.

iLikeGPUs · August 30, 2019, 3:23pm

HI Generix:

~$ ps aux | grep nvidia-persistenced
ml1 1286 0.0 0.0 17324 1668 ? Ss 08:32 0:00 /usr/bin/nvidia-persistenced --user ml1

Recall I mentioned I swap the two Titan V just to make sure it works, and
both Titan v cards work on PCIe Bus ID 00000000:43:00.0 (i.e. GPU slot #3)
since it was the ONLY display PCIe slot.

Since GPU slot #2 now has power (as shown by nvidia-smi above)
after enabling nvidia-persistenced, but nvidia-smi still did not
display any power consumption (Pwr:Usage/Cap = N/A / N/A)
for this TItan V GPU on this PCIe slot bus ID = 00000000:42:00.0.

What does this nvidia-smi result mean, and what do I have to do next?

Thank you.

generix · August 31, 2019, 11:22am

Probably some subtle driver bug. What’s the output of

cat /sys/bus/pci/devices/0000\:42\:00.0/power/control

?
Does it display any power usage when under load?
Otherwise, you could check if it’s a regression by installing the 418 driver or earlier (kernel <5.0).

289387745 · April 10, 2020, 4:02pm

我在ubuntu20.04LTS里运行了两张GPU,一张是gtx1060,和gtx 1080ti,是使用的ubuntu自带显卡驱动的最新版本，是可以正常运行的。我只要是运行davinci resolve studio16.2软件做视频调色剪辑。

Topic		Replies	Views
NVIDIA driver is not loaded. Ubuntu 18.10 Linux	310	130292	February 14, 2024
Black screen after install of nvidia driver ubuntu Linux	224	160699	February 27, 2025
High CPU usage on xorg when the external monitor is plugged in Linux	120	38748	June 21, 2023
Dual GPU problem with multiple displays in GNU/Linux Linux	12	10344	October 12, 2021
[SOLVED] Run CUDA on dedicated NVIDIA GPU while connecting monitors to Intel HD graphics, is this possible? CUDA Setup and Installation	15	72044	December 9, 2018
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371607	March 19, 2021
Reproducible: NVRM: GPU at 0000:01:00.0 has fallen off the bus. -- Both screens black, Xorg at 100% Linux	24	51054	December 16, 2015
Install Problem CUDA Programming and Performance	32	12724	December 17, 2009
2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple CUDA Programming and Performance	28	35556	January 29, 2009
nvidia-settings ERROR: Unable to load info from any available system Linux	48	121359	August 18, 2020

Dual GeForce GTX or Titan V on mobo, unable to display upon launching Ubuntu 18.04

Related topics