About to start installing CUDA -- need to clarify a few questions

I don’t want to mess up the installation and then have to format the drive, so I figured I ask a few questions to make sure I know what I will be doing.

  1. This is a clean install on Centos 7.6. There’s no video card drivers currently installed. Do I need to install the drivers and then install Cuda or installation of Cuda will automatically install the GPU drivers too?

  2. If yes, will the Cuda installation automatically install nvidia-smi?

  3. I was thinking of doing a package manager installation. For Centos instructions start with Satisfy DKMS dependency. But what exactly needs to be done – how to actually satisfy them – is not exactly clear to me.

  4. In step 4 here https://docs.nvidia.com/cuda/archive/10.1/cuda-installation-guide-linux/index.html#redhat-installation what do I use in place of , , ?

  5. Do I actually have to download CUDA Toolkit? I don’t see where it comes into the installation – as far as I can understand the package manager installation procedure installs from an online repository. Yes?

Thank you.

  1. Yes - get your installers from http://www.nvidia.com/getcuda
  2. Yes
  3. You should add the epel repository. If you google “centos add epel repository” you will find various instructions. The package manager install process will then pull in DKMS if needed.
  4. Look at the link I provided in item 1 above. You do have to actually download something, and the answer to this question is basically answered in the filename you download.
  5. Yes, you actually have to download something. You can either download a full package set, or a meta package set that uses the network to download most packages needed. This is the difference between “local” install and “network install” for package manager install options. But you have to download something, follow the instructions there.

Thank you. Btw, the file I downloaded has “-local-10.1.168-418.67-” in it, but there’s no checksum for file like this on this list: https://developer.download.nvidia.com/compute/cuda/10.1/Prod/docs/sidebar/md5sum.txt I’m proceeding without checking the sum, just letting you know if you want to update the list.

Not sure where you got that link from.

Use this:

https://developer.download.nvidia.com/compute/cuda/10.1/Prod/docs2/sidebar/md5sum.txt

It is linked from the current download page.

https://docs.nvidia.com/cuda/archive/10.1/cuda-installation-guide-linux/index.html#download-nvidia-driver-and-cuda-software points to checksums for version 9.2 https://developer.download.nvidia.com/compute/cuda/9.2/Prod/docs/sidebar/md5sum.txt What I did was changing 9.2 in the url for 10.1.

I guess you shouldn’t do that. I guess you should use the links provided on the download page.

There are (were) multiple versions of CUDA 10.1 released. The one that is currently available readily (without going to the archive page) is CUDA 10.1U1 (CUDA 10.1, update 1) and it has version 10.1.168, whereas the previous 10.1 release had version 10.1.105, and they also had different drivers bundled with their respective installers.

ok, but the point is documentation for version 10.1 points to checksums for version 9.2 – just look at the links I gave. This is how I initially ended up with a wrong list which I tried to fix by changing 9.2 for 10.1 in the url.

Yes, that looks messed up.

Use the checksum link that is linked from the download page(s).

Got it. The checksum checks out for me. Now nvidia-smi freezes though and only ctrl-alt-del seems to help, but that’s a separate issue…

nvidia-smi can take a long time to run on some system setups.
I would recommend doing a system reboot if you haven’t done one after the install.
Also, you may want to set persistence mode on your gpus, as root:

nvidia-smi -pm 1

This usually speeds up subsequent nvidia-smi calls.

Turns out nvidia-smi didn’t actually freeze. Whenever I run nvidia-smi or nvidia-smi -dmon or nvidia-smi -pm 1 or ./deviceQuery (in samples binaries folder), the screen doesn’t scroll down anymore when the output overflows it – the cursor is somewhere below the last line. “Clear” doesn’t clear the screen. But computer isn’t frozen. If I type “reboot” and press “enter”, it reboots.

This is a Centos minimal install (no GUI). Perhaps, if I installed GNOME, this would cease to be a problem. But what if this is an indication that I somehow messed up the CUDA installation or something’s wrong with the GPU? Samples compiled successfully, btw.

GPU is Nvidia 2080.

sounds like the driver install is messed up. I’m not able to diagnose it completely based on what you indicate.

Did you use the package manager install method? Do you have a 3.10.xxxxx kernel on your CentOS 7.6?

Did you use the package manager install method?

Yes.

Do you have a 3.10.xxxxx kernel on your CentOS 7.6?

It’s CentOS 7.6.1810, wiki says kernel version should be 3.10.0-957.

I have another Nvidia 2080 and a 5-year old Titan that has been through a lot computationally. I’ll try them tomorrow just to rule out there’s something wrong with the GPU.

Other than that, what would my options be? I realize I will have to uninstall the current installation following these steps: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation But what then? Try package manager installation version 10.0 instead of 10.1? Or go for runfile installation? Would I lose computationally if I used runfile instead of package manager? Thanks.

Did you follow these steps:

sudo rpm -i cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.x86_64.rpm
sudo yum clean all
sudo yum install cuda

you might also just try a yum remove cuda / yum install cuda sequence

to see if anything changes

After that I would probably try the runfile install method first. Yes, you have to fully remove the package manager install. You don’t “lose” any CUDA functionality with runfile install.

Did you follow these steps:

Yes, but instead of sudo yum clean all I used sudo yum clean expire-cache, the latter is in the full installation guide.

So I tried Titan instead of 2080 and there’s no problem. The screen scrolls down and clears, the cursor is always visible. deviceQuery-test passes. One thing that could be important is that I used different display with Titan which I connected via DVI. With 2080 I used HDMI. 2080 unfortunately doesn’t have a DVI interface, but it has DisplayPort. I’ll try to find a display with DP interface or a DVI-to-DisplayPort adapter. Meanwhile, any ideas what’s going on and what else could be tried?

UPDATE: I’ve now tried the following setup: both Titan and 2080 are installed, display is hooked up to Titan via DVI. nvidia-smi sees both cards. It behaves normally – the screen scrolls down and clears, the cursor is always visible. deviceQuery passes for both cards. To sum up, it’s only when 2080 is the primary card connected to a display that I have the problem. Whether HDMI’s got anything to do with it is an open question as I’m still looking for a DisplayPort-adapter.