Visual studio cuda code generation selection benefits

Hi,

I was wondering if someone could explain to me what the benefits are for setting the cuda code generation in visual studio to exactly match the GPU you are targeting. Will the code be more optimized and run faster etc ?

Example : What would be the negative effects of compiling for compute_30,sm_30 if you have a graphics card with a higher compute capability

This is for NVCC compilation type : --compile

Thanks in advance!

The topic of how the arch specifications at compile time affect the generation of GPU code both PTX and SASS is covered in many places such as here and here. I suggest you familiarize yourself with that first.

Briefly, with that knowledge:

  • compiling for compute_30,sm_30 (only) means your executable would contain SASS code only, and that code would be only suitable for running on cc3.x devices. The code would not run and would offer a runtime error via the CUDA runtime API, if you attempted to run it on, say, a cc5.x GPU. You could avoid this by adding a target that specified compute_30,compute_30 which would include the generation of PTX code, which could be forward-JIT-compiled by the GPU driver to run on a GPU with a higher compute capability. The remainder of my comments will focus on that case.
  • compiling for compute_30,compute_30 means you may trigger a JIT compilation step. Probably not a big deal.
  • compiling for compute_30,compute_30 (only) means your code cannot make use of features which were introduced on newer architectures. To pick just one example, you could not use the natively supplied double atomicAdd.
  • compiling for compute_30, compute_30 with a JIT recompile for a newer GPU means that you are depending on the JIT compiler contained in the driver to produce device executable code. That compiler may or may not have all the capabilities of the compiler contained in the nvcc compiler-driver for a particular GPU architecture. We would normally expect it to be approximately equivalent, but exact equivalence is not guaranteed. The code generation may be different, therefore the performance may be different.
1 Like

Hi Robert_Crovella!

Thanks for your replay!

I just have one question regarding your answer that makes the rest a bit confusing :

“compiling for compute_30,sm_30 (only) means your executable would contain SASS code only, and that code would be only suitable for running on cc3.x devices. The code would not run and would offer a runtime error via the CUDA runtime API, if you attempted to run it on, say, a cc5.x GPU”

We are currently compiling in Visual studio with compute_30,sm_30 , CUDA 10 and have no problems deploying the application to computers that have a higher compute capability, 5.2, 6.1.

I don’t know if I read this correctly but the application will be able to run if a “suitable PTX” code is part of the application. A “suitable PTX” is one which is numerically equal to or lower than the GPU architecture being targeted for running the code

We’re probably getting hung up over the english language description we are using and the actual fields in the VS project and how they map to actual nvcc compile command line settings.

specifying -gencode arch=compute_30,code=sm_30 will produce SASS only. I don’t remember exactly how VS maps the CUDA arch setting in the project settings to compile command line. It is probably my mistake. I think maybe in VS if you specify compute_30,sm_30 it may translate to something like:

-gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30

which would generate both PTX and SASS. Yes, a suitable PTX is one that is numerically equal to or lower than the compute capability of the GPU architecture you are running on.

1 Like

Thanks so much for your help!
Yes you are correct, visual studio compiles with the following arg : -gencode=arch=compute_30,code=“sm_30,compute_30”

But to summarize : For performance benefits it is always a good thing to build for the compute capability you are targeting.

I think that is a reasonable statement. If you’re not terribly concerned about code size/bloat, then the cuda samples provide a reasonable hint:

  • provide SASS code for every architecture that you are likely to run on, that your build toolchain supports
  • provide a single instance of PTX to provide forward compatibility for “future” architectures.

I believe, roughly speaking, the CUDA libraries (e.g. CUBLAS, etc.) do this sort of thing.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.