Hello guys,
I run into a problem when I try to do some training with Deep Learning. I work with a workstation with Ubuntu 16.04.5 LTS. I think I have successfully installed the toolkit and the driver 410.73.
yangyang@WS016:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
yangyang@WS016:~$ nvidia-smi
Tue Nov 20 12:04:05 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73 Driver Version: 410.73 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:42:00.0 On | N/A |
| 0% 44C P0 51W / 175W | 342MiB / 7949MiB | 2% Default |
±------------------------------±---------------------±---------------------+
But when I run my python train.py, it comes with nvcc error:
nvcc -std=c++11 -c -o …/cc/nms/nms_kernel.cu.o …/cc/nms/nms_kernel.cu.cc -I/usr/local/cuda/include -x cu -Xcompiler -fPIC -arch=sm_75 --expt-relaxed-constexpr
nvcc fatal : Value ‘sm_75’ is not defined for option ‘gpu-architecture’
concurrent.futures.process._RemoteTraceback:
“”"
Traceback (most recent call last):
File “/home/yangyang/second.pytorch/second/core/non_max_suppression/nms_cpu.py”, line 10, in
from second.core.non_max_suppression.nms import (
ModuleNotFoundError: No module named ‘second.core.non_max_suppression.nms’
I notice your release note, CUDA 10.0 adds support for Turing Architecture. Does that mean my new 2070 only works with CUDA 10.0? But I have to use CUDA 9.0 because of some package dependencies.
By the way, everything works perfect with my old GTX 960. The old graphic card was replaced with this new one yesterday. Then however hard I uninstall and install new toolkit and the driver, it doesn’t work any more.
Thanks for your help in advanace!
Yang
It’s recommended to use CUDA 10 with Turing. It should be possible to use CUDA 9 with Turing. CUDA 9 supports -arch=sm_70, which will also work on Turing as a compile switch.
It’s not possible for me to say more, because I’m not sure how the -arch=sm_75 compile switch got into your pytorch toolchain. It must be coming from a configuration step or the specific build of pytorch that you installed, but you haven’t said what you installed exactly or how you installed it. Even if you did, I’m not sure I can sort it out for you.
But if you use a version of pytorch consistent with CUDA 9, it should work on CUDA 9, even on Turing.
Hello Robert,
thanks for your reply first!
As for the pytorch, I install it with the conda command “conda install pytorch torchvision -c pytorch” following their official website’s instruction Start Locally | PyTorch. It seems the latest version of CUDA that pytorch supports is 9.2. So I’m not sure whether CUDA 10 works samely.
I also tried to uninstall CUDA 9.0 and install CUDA 9.2, it doesn’t solve the problem either.
I thinks I have to use CUDA 9 because I need to import one package from Google that only works with CUDA 9.
Thanks again for your help!
Yang
Presumably the compile command line you indicated:
nvcc -std=c++11 -c -o …/cc/nms/nms_kernel.cu.o …/cc/nms/nms_kernel.cu.cc -I/usr/local/cuda/include -x cu -Xcompiler -fPIC -arch=sm_75 --expt-relaxed-constexpr
is coming from pytorch. However the compile switch in that command line:
-arch=sm_75
is only supported/relevant/sensible for CUDA 10. So it seems you somehow have a pytorch install that is aware of and expecting CUDA 10. It may be that the latest pytorch installed via conda is CUDA 10 aware.
You would either need to install a version of pytorch that is not CUDA 10 aware (and instead only knows about CUDA 9.x) or else you would need to somehow configure pytorch to use
-arch=sm_70
instead of
-arch=sm_75
when it generates that pytorch-generated compile command line.
Hello Roberto,
thanks for the information again.
By grep I find several files and one of them in my case is called find.py who has a function: def find_cuda_device_arch(). This def reads the graphic card info by arch = f"sm_{arch[0]}{arch[-1]}". So I overwrite it with arch = “sm_70”.
Till now, this error stops coming. I would update it if any other problems appear. Also, if someone has a better solution, you’re welcomed to inform me of your idea!
thanks a lot~
Yang