Understanding code optimization resulting from the --gpu-architecture, --gpu-code and --generate-code flags

Context
I’m looking to ship compiled CUDA code that should support a wide range of NVIDIA GPU models. I see two options:

  1. generate fat binaries and thus ship a single (fat) binary
  2. generate one binary per CUDA compute capability (or maybe: per major CC) and ship those, so that users can get the binary that fits their architecture

In order to make this choice, I’ve been trying to understand the exact way in which code gets optimized for a particular architecture using the --gpu-architecture, --gpu-code and --generate-code flags. Note: this code is meant for execution on HPC systems, so performance is a very important factor - and even a few percent difference could push me one way or the other.

Of course, I have read the documentation at NVIDIA CUDA Compiler Driver and understand the two-staged compilation process. The documentation is really good, but still leaves me with quite a number of (detailed) questions. Since I think it could benefit anyone who cares about top-notch performance, I’m hoping an nvcc expert is willing to clarify these things :)

Question 1
What is unclear to me, is whether the stage 1 compilation has any effect on how fast the code will run. For example, I understand that both

nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,sm_70 ...

and

nvcc --gpu-architecture=compute_70 --gpu-code=sm_70 ...

create binary (SASS) code for the sm_70 GPU architecture.

Is that code going to be identical? Or is there a potential (performance) difference between these two approaches?

Question 2
Assuming the answer to the above question is that there is no performance difference: what is the use case for the --generate-code flags? I mean, I could do

nvcc --generate-code arch=compute_50,code=sm_50 \
    --generate-code arch=compute_70,code=sm_70 ...

but what’s the point if I could just do

nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,sm_70 ...

as well? The only possible reason I could see is if a code has different code paths depending on the __CUDA_ARCH__(and the --generate-code invocation would, I assume, compile the code twice, once with each value for __CUDA_ARCH?)

Question 3

One could also do

nvcc --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50 ...

If this gets executed on an sm_70 capable GPU, my understanding is that the SASS code for that sm_70 will be compiled from the compute_50 PTX.

Will that be just as optimized as when nvcc generates the SASS code for that architecture, i.e. using nvcc --gpu-architecture=compute_50 --gpu-code=sm_50,sm_70? Or does the JIT compiler somehow do less optimization than nvcc, e.g. to not make startup latency too excessive?

Question 4

Section 5.4 of the documentation states:

From this it follows that the virtual architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run on…

What is ‘as low as possible’ in this context? Does that only depend on the (API) functionality used in the CUDA code? For example, consider a CUDA code using Warp Matrix functions as described on 1. Introduction — CUDA C Programming Guide . That would need CC 7.0 or higher. Does ‘as low as possible’ for this code simply mean that it needs to be compiled with --gpu-architecture=compute_70? In other words: can I just try to make the virtual architecture as low as possible, and as long as the compilation doesn’t fail, that is ok? Or could lowering the virtual architecture lead to succesful compilation, but a reduced amount of optimization the compiler could do?

Question 5

Related to the above questions, and just to make this abundantly clear: does the stage one compilation step do any form of optimization? Or does all the optimization for specific architectures happen in stage two?

Question 6

Is there any downside to building for all possible architectures, i.e.

nvcc --generate-code arch=compute_50,code=sm_50 \
    --generate-code arch=compute_52,code=sm_52 \
    --generate-code arch=compute_53,code=sm_53 \
...
    --generate-code arch=compute_90a,code=sm_90a ...

except for longer compilation times, and a larger binary (i.e. more storage space & potentially a bit longer startup time due to longer file read)?

Question 7

The -arch=all flag states that it

embeds a compiled code image for all supported architectures (sm_*) , and a PTX program for the highest major virtual architecture

does that mean it is equivalent to

nvcc --generate-code arch=compute_50,code=sm_50 \
    --generate-code arch=compute_52,code=sm_52 \
    --generate-code arch=compute_53,code=sm_53 \
...
    --generate-code arch=compute_90,code=sm_90,compute_90 \
    --generate-code arch=compute_90a,code=sm_90a ...

?

not necessarily identical. There could be perf differences.

Not necessarily just as optimized. There is no guarantee that the JIT compiler matches the offline compiler in every respect. Plus the JIT compiler varies by driver installed.

As low as possible is stated so as to

just as stated. If you don’t care about that, then choosing as low as possible is not necessarily sensible. If you compile for compute_70, for example, you have no chance of running on a Pascal GPU. If that is important to you, you should compile for compute_60. If its not important to you, there is no reason to compile for compute_60. There is no benefit.