Cannot install cuda

Hi, this is extremely frustrating. I want to install CUDA 12.1 and nvidia-driver-545.

  • Why CUDA 12.1? Because Pytorch only supports 12.1 and many models have all kinds of nvcc-dependent libraries (apex, xformers, etc.) that want a specific CUDA subversion

  • Why 545 and not the 530 that comes with 12.1? Because 530 doesn’t compile – there are errors in the source code:

/var/lib/dkms/nvidia/530.30.02/build/common/inc/nv-mm.h:88:16: error: too many arguments to function ‘get_user_pages’
   88 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                ^~~~~~~~~~~~~~
./include/linux/mm.h:2431:59: note: expected ‘struct page **’ but argument is of type ‘long unsigned int’
 2431 |                     unsigned int gup_flags, struct page **pages);
      |                                             ~~~~~~~~~~~~~~^~~~~
/var/lib/dkms/nvidia/530.30.02/build/common/inc/nv-mm.h:88:16: error: too many arguments to function ‘get_user_pages’
   88 |         return get_user_pages(current, current->mm, start, nr_pages, write,
      |                ^~~~~~~~~~~~~~
./include/linux/mm.h:2430:6: note: declared here

How can I install this combination?

install 545 in whatever method works for you. Then install CUDA 12.1 without installing the driver. If you use the runfile method, you can deselect the driver install. If you use the package manager method, then instead of doing

sudo apt install cuda

you would do:

sudo apt install cuda-toolkit-12-1

You can find additional info in the cuda linux install guide.

Thanks @Robert_Crovella . However:

If I install 545 using runfile, the apt-get version of CUDA is unhappy because it has an apt package dependency and wants to install deb 530 over runfile 545.

If apt-get install nvidia-driver-545 using cuda-12-3 local repo and then install cuda-toolkit-12-1 using the cuda-12-1 local repo (having both local repos installed at the same time), it installs, and pytorch works, but then, I can’t install nvcc for other things I need.

Attempting to install nvcc results in this mess. It wants to uninstall nvidia-driver-545:

$ sudo apt install cuda-toolkit-12-1
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cuda-toolkit-12-1 is already the newest version (12.1.0-1).
The following packages were automatically installed and are no longer required:
  nvidia-firmware-545-545.29.06 nvidia-modprobe
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.

$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

$ sudo apt install nvidia-cuda-toolkit
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libnvidia-cfg1-545 libnvidia-common-545 libnvidia-extra-545 libnvidia-fbc1-545 libpkgconf3 nvidia-dkms-545 nvidia-firmware-545-545.29.06 nvidia-kernel-common-545 nvidia-modprobe
  nvidia-prime nvidia-settings pkg-config pkgconf pkgconf-bin screen-resolution-extra xserver-xorg-video-nvidia-545
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  g++-12 libaccinj64-12.0 libcu++-dev libcub-dev libcublas12 libcublaslt12 libcudart12 libcufft11 libcufftw11 libcuinj64-12.0 libcupti-dev libcupti-doc libcupti12 libcurand10 libcusolver11
  libcusolvermg11 libcusparse12 libgl-dev libglx-dev libhwloc-plugins libhwloc15 libnppc12 libnppial12 libnppicc12 libnppidei12 libnppif12 libnppig12 libnppim12 libnppist12 libnppisu12
  libnppitc12 libnpps12 libnvblas12 libnvidia-compute-525 libnvidia-ml-dev libnvjitlink12 libnvjpeg12 libnvrtc-builtins12.0 libnvrtc12 libnvtoolsext1 libnvvm4 libstdc++-12-dev libtbb-dev
  libtbb12 libtbbbind-2-5 libtbbmalloc2 libthrust-dev libvdpau-dev node-html5shiv nvidia-cuda-dev nvidia-cuda-gdb nvidia-cuda-toolkit-doc nvidia-opencl-dev nvidia-profiler
  nvidia-visual-profiler ocl-icd-opencl-dev opencl-c-headers opencl-clhpp-headers openjdk-8-jre openjdk-8-jre-headless
Suggested packages:
  g++-12-multilib gcc-12-doc libhwloc-contrib-plugins libstdc++-12-doc libtbb-doc libvdpau-doc nodejs nvidia-cuda-samples opencl-clhpp-headers-doc fonts-nanum fonts-ipafont-gothic
  fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic
Recommended packages:
  libnvcuvid1 nsight-compute nsight-systems
The following packages will be REMOVED:
  libnvidia-compute-545 libnvidia-decode-545 libnvidia-encode-545 libnvidia-gl-545 nvidia-compute-utils-545 nvidia-driver-545 nvidia-utils-545
The following NEW packages will be installed:
  g++-12 libaccinj64-12.0 libcu++-dev libcub-dev libcublas12 libcublaslt12 libcudart12 libcufft11 libcufftw11 libcuinj64-12.0 libcupti-dev libcupti-doc libcupti12 libcurand10 libcusolver11
  libcusolvermg11 libcusparse12 libgl-dev libglx-dev libhwloc-plugins libhwloc15 libnppc12 libnppial12 libnppicc12 libnppidei12 libnppif12 libnppig12 libnppim12 libnppist12 libnppisu12
  libnppitc12 libnpps12 libnvblas12 libnvidia-compute-525 libnvidia-ml-dev libnvjitlink12 libnvjpeg12 libnvrtc-builtins12.0 libnvrtc12 libnvtoolsext1 libnvvm4 libstdc++-12-dev libtbb-dev
  libtbb12 libtbbbind-2-5 libtbbmalloc2 libthrust-dev libvdpau-dev node-html5shiv nvidia-cuda-dev nvidia-cuda-gdb nvidia-cuda-toolkit nvidia-cuda-toolkit-doc nvidia-opencl-dev
  nvidia-profiler nvidia-visual-profiler ocl-icd-opencl-dev opencl-c-headers opencl-clhpp-headers openjdk-8-jre openjdk-8-jre-headless
0 upgraded, 61 newly installed, 7 to remove and 23 not upgraded.
Need to get 201 MB/1,638 MB of archives.
After this operation, 4,797 MB of additional disk space will be used.
Do you want to continue? [Y/n] ^C

OK, I found that nvcc exists in /usr/local/cuda-*/bin/ but somehow it was not symlinked to /usr/local/bin.
I don’t know where this rogue nvidia-cuda-toolkit suggested by apt comes from but it’s suspicious.

I then tried to install apex and got this error:

$ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
... [snip] ...
Building wheels for collected packages: apex
  Running command Building wheel for apex (pyproject.toml)


  torch.__version__  = 2.2.1+cu121



  Compiling cuda extensions with
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2023 NVIDIA Corporation
  Built on Fri_Nov__3_17:16:49_PDT_2023
  Cuda compilation tools, release 12.3, V12.3.103
  Build cuda_12.3.r12.3/compiler.33492891_0
  from /usr/local/cuda/bin

  Traceback (most recent call last):
    File "/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
      main()
    File "/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
      return _build_backend().build_wheel(wheel_directory, config_settings,
    File "/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/setuptools/build_meta.py", line 434, in build_wheel
      return self._build_with_temp_dir(
    File "/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/setuptools/build_meta.py", line 419, in _build_with_temp_dir
      self.run_setup()
    File "/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/setuptools/build_meta.py", line 341, in run_setup
      exec(code, locals())
    File "<string>", line 178, in <module>
    File "<string>", line 40, in check_cuda_torch_binary_vs_bare_metal
  RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 12.1.
  In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
  error: subprocess-exited-with-error

I then did this to fix the above (I really wish that nvcc could have aggressively searched the system and automatically found this other cuda installation and switched to it instead of complaining about a version mismatch)

$ ls -l /etc/alternatives/ | grep cuda
lrwxrwxrwx 1 root root  20 Dec 13 08:36 cuda -> /usr/local/cuda-12.3
lrwxrwxrwx 1 root root  20 Dec 13 08:36 cuda-12 -> /usr/local/cuda-12.3

$ sudo rm /etc/alternatives/cuda
$ sudo rm /etc/alternatives/cuda-12
$ sudo ln -s /usr/local/cuda-12.1/ cuda
$ sudo ln -s /usr/local/cuda-12.1/ cuda-12

And now I have this – any thoughts? Is there a patch to make this work on gcc-12?

$ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
... [snip] ...
  In file included from /usr/local/cuda/include/cuda_runtime.h:83,
                   from <command-line>:
  /usr/local/cuda/include/crt/host_config.h:132:2: error: #error -- unsupported GNU version! gcc versions later than 12 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
    132 | #error -- unsupported GNU version! gcc versions later than 12 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
        |  ^~~~~
  [5/15] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-req-build-m3zpdf4t/build/temp.linux-x86_64-cpython-310/csrc/update_scale_hysteresis.o.d -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dheera/miniconda3/envs/opensora/include/python3.10 -c -c /tmp/pip-req-build-m3zpdf4t/csrc/update_scale_hysteresis.cu -o /tmp/pip-req-build-m3zpdf4t/build/temp.linux-x86_64-cpython-310/csrc/update_scale_hysteresis.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++17
  FAILED: /tmp/pip-req-build-m3zpdf4t/build/temp.linux-x86_64-cpython-310/csrc/update_scale_hysteresis.o
  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-req-build-m3zpdf4t/build/temp.linux-x86_64-cpython-310/csrc/update_scale_hysteresis.o.d -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dheera/miniconda3/envs/opensora/include/python3.10 -c -c /tmp/pip-req-build-m3zpdf4t/csrc/update_scale_hysteresis.cu -o /tmp/pip-req-build-m3zpdf4t/build/temp.linux-x86_64-cpython-310/csrc/update_scale_hysteresis.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++17

Can we please, please, please train an LLM on all of this forum and create an nvidia-figure-out-how-to-install command that just installs whatever configuration you want without complaining, and keeps aggressively trying out stuff all night with forum data until it gets it installed? If we can make humanoids and self-driving cars, NVIDIA toolkits should be able to figure out how to install themselves, without complaining, from a single command e.g.

$ nvidia-figure-out-how-to-install nvidia-driver-545 cuda-12-1 apex pytorch-2.2
... [ tries all kinds of stuff for 5 hours, edits source code when there is a C compiler error, tries every driver version, spins up containers with every possible combination of compiler and headers until it gets exactly the right environment ] ...
Done!

I tried this to fix the above (this is the kind of stuff the LLM should try):

$ sudo apt install gcc-12
$ sudo apt install g++-12
$ export CC=/usr/bin/gcc-12
$ export CC=/usr/bin/g++-12

Now I run into this and stuck here:

$ pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

... [ snip ] ...

  /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-req-build-4gjup1ch/build/temp.linux-x86_64-cpython-310/csrc/mlp_cuda.o.d -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/TH -I/home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/dheera/miniconda3/envs/opensora/include/python3.10 -c -c /tmp/pip-req-build-4gjup1ch/csrc/mlp_cuda.cu -o /tmp/pip-req-build-4gjup1ch/build/temp.linux-x86_64-cpython-310/csrc/mlp_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=mlp_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -ccbin /usr/bin/gcc-12 -std=c++17
  /home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&)’:
  /home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected template-name before ‘<’ token
     45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
        |                                                                                                                        ^
  /home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected identifier before ‘<’ token
  /home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression before ‘>’ token
     45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
        |                                                                                                                           ^
  /home/dheera/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression before ‘)’ token
     45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
        |                                                                                                                              ^
  ninja: build stopped: subcommand failed.