Can no longer create backward compatible CUDA binary with Titan V and CUDA 9

I have been making releases of my CUDA software using the below nvcc flag

-arch sm_20

This makes the generated binary compatible with all past generations of CUDA hardware. Unfortunately, with the arrival of CUDA 9 and Titan V, I found my binary deployment plan failed to work.

First, CUDA 9 dropped the support for sm_20, so I modified the above flag to

-arch sm_35

to maximize the hardware support, however, this binary fails to run on Titan V, and gives me the follow error

GPU=1 (TITAN V) threadph=61 extra=5760 np=10000000 nthread=163840 maxgate=1 repetition=1
initializing streams ...	
MCX ERROR(-13):invalid device symbol in unit mcx_core.cu:1878

the only way for me to get it to run on TitanV is to change the above flag to

-arch sm_70

I don’t understand why the sm_35 compiled binary stops working on TitanV. is this a bug with cuda 9? unfortunately I can’t revert back to cuda 8 because it does not support volta.

Is there a way to create a single binary that can be executed on all kepler to volta GPUs?

thanks

-arch=sm_30 should work on Kepler to Volta when compiling with CUDA 9 (or simply omit the arch switch - sm_30 is the default)

This is easily demonstrable with some application other than the one you are currently testing, for example a CUDA sample code such as vectorAdd.

It’s unclear why your application is having this issue, and a self-contained reproducer may help with investigation.

my code is open-source, the commands to reproduce this issue, at least on my machine, are

git clone https://github.com/fangq/mcx.git
cd mcx/src
make
cd ../example/benchmark
./run_benchmark1.sh -n 1e7

this is the output from my Linux box (Ubuntu 14.04) with 2x Titan V, still having the same problem even with sm_30.

fangq@taote:/drives/taote1/users/fangq/git/Project/github/mcx/src$ make
nvcc -c -g -lineinfo -Xcompiler -Wall -Xcompiler -fopenmp -m64 -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=sm_30 -DMCX_TARGET_NAME='"Fermi MCX"' -DUSE_XORSHIFT128P_RAND -o mcx_core.o  mcx_core.cu
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o mcx_utils.o  mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o mcx_shapes.o  mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o tictoc.o  tictoc.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o mcextreme.o  mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o cjson/cJSON.o  cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++  -fopenmp -DUSE_XORSHIFT128P_RAND 
fangq@taote:/drives/taote1/users/fangq/git/Project/github/mcx/src$ 
fangq@taote:/drives/taote1/users/fangq/git/Project/github/mcx/src$ 
fangq@taote:/drives/taote1/users/fangq/git/Project/github/mcx/src$ cd ../example/benchmark/
fangq@taote:/drives/taote1/users/fangq/git/Project/github/mcx/example/benchmark$ ./run_benchmark1.sh 
###############################################################################
#                      Monte Carlo eXtreme (MCX) -- CUDA                      #
#          Copyright (c) 2009-2018 Qianqian Fang <q.fang at neu.edu>          #
#                             http://mcx.space/                               #
#                                                                             #
# Computational Optics & Translational Imaging (COTI) Lab- http://fanglab.org #
#            Department of Bioengineering, Northeastern University            #
###############################################################################
#    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365      #
###############################################################################
$Rev::db5a34 $ Last $Date::2018-07-20 19:58:31 -04$ by $Author::Qianqian Fang $
###############################################################################
- variant name: [Fermi] compiled for GPU Capability [100] with CUDA [9000]
- compiled with: RNG [xorshift128+] with Seed Length [4]
- this version CAN save photons at the detectors


GPU=1 (TITAN V) threadph=610 extra=57600 np=100000000 nthread=163840 maxgate=1 repetition=1
initializing streams ...	
[b]MCX ERROR(-13):invalid device symbol in unit mcx_core.cu:1878
[/b]fangq@taote:/drives/taote1/users/fangq/git/Project/github/mcx/example/benchmark$ ../../bin/mcx -L
=============================   GPU Infomation  ================================
Device 1 of 2:		TITAN V
Compute Capability:	7.0
Global Memory:		4062904320 B
Constant Memory:	65536 B
Shared Memory:		49152 B
Registers:		65536
Clock Speed:		1.46 GHz
Number of MPs:		80
Number of Cores:	10240
SMX count:		80
=============================   GPU Infomation  ================================
Device 2 of 2:		TITAN V
Compute Capability:	7.0
Global Memory:		4062904320 B
Constant Memory:	65536 B
Shared Memory:		49152 B
Registers:		65536
Clock Speed:		1.46 GHz
Number of MPs:		80
Number of Cores:	10240
SMX count:		80
fangq@taote:/drives/taote1/users/fangq/git/Project/github/mcx/example/benchmark$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

I also tried your suggested vectorAdd, and got the below error

fangq@taote:.../cuda-9.0/samples/0_Simple/vectorAdd$ /pub/cuda-9.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64  -arch=sm_30 -o vectorAdd.o -c vectorAdd.cu
fangq@taote:.../cuda-9.0/samples/0_Simple/vectorAdd$ /pub/cuda-9.0/bin/nvcc -ccbin g++   -m64     -arch=sm_30 -o vectorAdd vectorAdd.o 
fangq@taote:.../cuda-9.0/samples/0_Simple/vectorAdd$ ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Failed to launch vectorAdd kernel (error code PTX JIT compiler library not found)!

I don’t have any trouble with the vectorAdd test on CUDA 9.2, Tesla V100:

$ nvcc -I/usr/local/cuda/samples/common/inc /usr/local/cuda/samples/0_Simple/vectorAdd/vectorAdd.cu -o vectorAdd
$ ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
$

Your CUDA install may be broken. If your PATH and LD_LIBRARY_PATH variables are not set correctly, you may have trouble. If your PATH were set correctly, there would be no need to do /pub/cuda-9.0/bin/nvcc

I also didn’t seem to have any trouble with your MCX test case, although I’m running on a Tesla V100:

$ git clone https://github.com/fangq/mcx.git
Cloning into 'mcx'...
remote: Counting objects: 4229, done.
remote: Compressing objects: 100% (53/53), done.
remote: Total 4229 (delta 55), reused 69 (delta 40), pack-reused 4136
Receiving objects: 100% (4229/4229), 10.08 MiB | 7.73 MiB/s, done.
Resolving deltas: 100% (3113/3113), done.
$ cd mcx/src
$ make
nvcc -c -g -lineinfo -Xcompiler -Wall -Xcompiler -fopenmp -m64 -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=sm_35 -DMCX_TARGET_NAME='"Fermi MCX"' -DUSE_XORSHIFT128P_RAND -o mcx_core.o  mcx_core.cu
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o mcx_utils.o  mcx_utils.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o mcx_shapes.o  mcx_shapes.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o tictoc.o  tictoc.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o mcextreme.o  mcextreme.c
cc -I/usr/local/cuda/include -g -Wall -std=c99  -fopenmp -m64 -c -o cjson/cJSON.o  cjson/cJSON.c
cc mcx_core.o mcx_utils.o mcx_shapes.o tictoc.o mcextreme.o cjson/cJSON.o -o ../bin/mcx -L/usr/local/cuda/lib64 -lcudart -lm -lstdc++  -fopenmp -DUSE_XORSHIFT128P_RAND
$ cd ../example/benchmark
$ CUDA_VISIBLE_DEVICES="0" ./run_benchmark1.sh -n 1e7
###############################################################################
#                      Monte Carlo eXtreme (MCX) -- CUDA                      #
#          Copyright (c) 2009-2018 Qianqian Fang <q.fang at neu.edu>          #
#                             http://mcx.space/                               #
#                                                                             #
# Computational Optics & Translational Imaging (COTI) Lab- http://fanglab.org #
#            Department of Bioengineering, Northeastern University            #
###############################################################################
#    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365      #
###############################################################################
$Rev::       $ Last $Date::                       $ by $Author::              $
###############################################################################
- variant name: [Fermi] compiled for GPU Capability [100] with CUDA [9020]
- compiled with: RNG [xorshift128+] with Seed Length [4]
- this version CAN save photons at the detectors

GPU=1 (Tesla V100-PCIE-32GB) threadph=61 extra=5760 np=10000000 nthread=163840 maxgate=1 repetition=1
initializing streams ...        init complete : 1 ms
requesting 2560 bytes of shared memory
lauching MCX simulation for time window [0.00e+00ns 5.00e+00ns] ...
simulation run# 1 ...
kernel complete:        221 ms
retrieving fields ...   detected 29749 photons, total: 29749    transfer complete:      231 ms
data normalization complete : 234 ms
normalizing raw data ...        normalization factor alpha=20.000000
saving data to file ... 216000 1        saving data complete : 237 ms

simulated 10000000 photons (10000000) with 163840 threads (repeat x1)
MCX simulation speed: 56179.78 photon/ms
total simulated energy: 10000000.00     absorbed: 17.68169%
(loss due to initial specular reflection is excluded in the total)
$

This was all done on CUDA 9.2.148, CentOS 7, Tesla V100-32GB

I don’t have a CUDA 9.0 machine convenient to test.

Problem solved.

I was not alone. I googled the error “PTX JIT compiler library not found” given by the vectorAdd example, and found a few links, the below stackoverflow link was particularly useful

https://stackoverflow.com/questions/47258882/theano-gpu-support-ptx-jit-compiler-library-not-found

I found that that the libnvidia-ptxjitcompiler.so.1 file was linked to a non-existent older version (390.48, but the current version is 390.77). This could be potentially caused by some buggy installation script from the nvidia-390 driver package.

fangq@taote:/usr/lib/nvidia-390$ ls -lt libnvidia-ptxjitcompiler.so*
-rw-r--r-- 1 root root 10489736 Jul 11 01:28 libnvidia-ptxjitcompiler.so.390.77
lrwxrwxrwx 1 root root       34 Apr 19 16:35 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.390.48

after pointing the so file to the current version, I was able to run the sm_35 compiled binary.

thank you very much for the help! glad I can run my binary again without forcing to compile for titan v.