Context
I’m looking to ship compiled CUDA code that should support a wide range of NVIDIA GPU models. I see two options:
- generate fat binaries and thus ship a single (fat) binary
- generate one binary per CUDA compute capability (or maybe: per major CC) and ship those, so that users can get the binary that fits their architecture
In order to make this choice, I’ve been trying to understand the exact way in which code gets optimized for a particular architecture using the --gpu-architecture
, --gpu-code
and --generate-code
flags. Note: this code is meant for execution on HPC systems, so performance is a very important factor - and even a few percent difference could push me one way or the other.
Of course, I have read the documentation at NVIDIA CUDA Compiler Driver and understand the two-staged compilation process. The documentation is really good, but still leaves me with quite a number of (detailed) questions. Since I think it could benefit anyone who cares about top-notch performance, I’m hoping an nvcc expert is willing to clarify these things :)
Question 1
What is unclear to me, is whether the stage 1 compilation has any effect on how fast the code will run. For example, I understand that both
nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,sm_70 ...
and
nvcc --gpu-architecture=compute_70 --gpu-code=sm_70 ...
create binary (SASS) code for the sm_70 GPU architecture.
Is that code going to be identical? Or is there a potential (performance) difference between these two approaches?
Question 2
Assuming the answer to the above question is that there is no performance difference: what is the use case for the --generate-code
flags? I mean, I could do
nvcc --generate-code arch=compute_50,code=sm_50 \
--generate-code arch=compute_70,code=sm_70 ...
but what’s the point if I could just do
nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,sm_70 ...
as well? The only possible reason I could see is if a code has different code paths depending on the __CUDA_ARCH__
(and the --generate-code
invocation would, I assume, compile the code twice, once with each value for __CUDA_ARCH
?)
Question 3
One could also do
nvcc --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50 ...
If this gets executed on an sm_70
capable GPU, my understanding is that the SASS code for that sm_70
will be compiled from the compute_50
PTX.
Will that be just as optimized as when nvcc
generates the SASS code for that architecture, i.e. using nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,sm_70
? Or does the JIT compiler somehow do less optimization than nvcc, e.g. to not make startup latency too excessive?
Question 4
Section 5.4 of the documentation states:
From this it follows that the virtual architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run on…
What is ‘as low as possible’ in this context? Does that only depend on the (API) functionality used in the CUDA code? For example, consider a CUDA code using Warp Matrix functions as described on 1. Introduction — CUDA C Programming Guide . That would need CC 7.0 or higher. Does ‘as low as possible’ for this code simply mean that it needs to be compiled with --gpu-architecture=compute_70
? In other words: can I just try to make the virtual architecture as low as possible, and as long as the compilation doesn’t fail, that is ok? Or could lowering the virtual architecture lead to succesful compilation, but a reduced amount of optimization the compiler could do?
Question 5
Related to the above questions, and just to make this abundantly clear: does the stage one compilation step do any form of optimization? Or does all the optimization for specific architectures happen in stage two?
Question 6
Is there any downside to building for all possible architectures, i.e.
nvcc --generate-code arch=compute_50,code=sm_50 \
--generate-code arch=compute_52,code=sm_52 \
--generate-code arch=compute_53,code=sm_53 \
...
--generate-code arch=compute_90a,code=sm_90a ...
except for longer compilation times, and a larger binary (i.e. more storage space & potentially a bit longer startup time due to longer file read)?
Question 7
The -arch=all
flag states that it
embeds a compiled code image for all supported architectures
(sm_*)
, and a PTX program for the highest major virtual architecture
does that mean it is equivalent to
nvcc --generate-code arch=compute_50,code=sm_50 \
--generate-code arch=compute_52,code=sm_52 \
--generate-code arch=compute_53,code=sm_53 \
...
--generate-code arch=compute_90,code=sm_90,compute_90 \
--generate-code arch=compute_90a,code=sm_90a ...
?