Determining correct compute capability for a loaded PTX file/kernel ?


My software loads a PTX kernel via the CUDA driver API.

I think currently there might be a serious problem with CUDA driver API.

It has no API to determine/learn/detect/query the compute capability version that the PTX kernel was compiled for ?!

This means that my application is unable to set the correct compute capability launch parameters ?!

So far I have seen the driver api only have functionality to learn:
PTX version
Binary version

I doubt that these fields correspond directly with a compute capability version ?!

(Perhaps binary version means compute capability version ???)

A solution could be to convert a PTX version number to a compute capability version number.
(For example via a ptx-version-to-compute-capability version table or so).

Please clearify the situation.

Thanks and bye,

One solution could be trial and error.

Launch the kernel assuming maximum compute capabilities.

If kernal fails, trescend down to lower compute capability.

Repeat until kernel launches successfully.

Perhaps what you are looking for is cuFuncGetAttribute

“compute capability version” is missing from this list:


CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK: The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded.

CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES: The size in bytes of statically-allocated shared memory per block required by this function. This does not include dynamically-allocated shared memory requested by the user at runtime.

CU_FUNC_ATTRIBUTE_CONST_SIZE_BYTES: The size in bytes of user-allocated constant memory required by this function.

CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES: The size in bytes of local memory used by each thread of this function.

CU_FUNC_ATTRIBUTE_NUM_REGS: The number of registers used by each thread of this function.

CU_FUNC_ATTRIBUTE_PTX_VERSION: The PTX virtual architecture version for which the function was compiled. This value is the major PTX version * 10 + the minor PTX version, so a PTX version 1.3 function would return the value 13. Note that this may return the undefined value of 0 for cubins compiled prior to CUDA 3.0.

CU_FUNC_ATTRIBUTE_BINARY_VERSION: The binary architecture version for which the function was compiled. This value is the major binary version * 10 + the minor binary version, so a binary version 1.3 function would return the value 13. Note that this will return a value of 10 for legacy cubins that do not have a properly-encoded binary architecture version.

CU_FUNC_CACHE_MODE_CA: The attribute to indicate whether the function has been compiled with user specified option “-Xptxas --dlcm=ca” set .

So it is unclear which compute capability to use for PTX kernels.

For now I have implemented a trial and error approach and will upload my app soon so people can try it out :) (Skybuck’s Test CUDA Memory Performance version 0.08)

This website mentions “target architecture”.

Perhaps “target architecture” is same as ptx version or maybe binary version.

This could be tested by compiling to many different architectures and then examining these two fields.

This person also seems confused, the advise he got was to “trial and error”:

Hey there, let’s assume I have a code which let’s the user pass the threads_per_block to call the kernel. than I want to check, if the input is valid (e.g. <=512 for compute capability CC <2.0 and 1024 for CC >=2.0). now I wonder what happens when I compile the code with nvcc -arch=sm_13 while having a graphics card in my computer with CC2.0! what happens when a user passes threads_per_block 1024? Is this a valid input since the card I run has CC2.0 or is this input invalid since I compiled it for CC1.3? or does the nvcc -arch=sm_13 just mean that CC1.3 is at least necessary but when running it on higher CC, those higher features can although be used? thanks!
Advice from other person:
From the nvcc manual:


The architecture specified by this option is the architecture that is assumed by the compilation chain up to the ptx stage, …

This means it specifies what PTX features (like special instructions) the compiler can use. The maximum number of threads per block is not specified by the PTX ISA, and thus this compiler parameter is not relevant to the problem you’re trying to solve.

The best way to check if threads_per_block is valid, is to just launch the kernel and see if any errors occur.

The “trial and error” approach does not seem to be working for owners of GTX 970.

Possible conclusions:

  1. CUDA is simply not backwards compatible dispite it’s claims.
  2. CUDA kernels will have to recompiled for future graphics cards otherwise they won’t work.
  3. Or alternatively… these “bugs/shortcomings” have to be fixed inside cuda driver.

Assumption is that “re-compiling” cuda kernel will solve problems current users of GTX 970 are having with my app. However there is also a slight possibilities that some other launch parameter might be causing issues. So I think it’s time I try the re-compile approach, and try using/distributing different ptx kernel versions… just to see if that will solve it or not.

compute capability version number could therefore be stored in filename, there would be multiple files… application then needs to load correct file… based on device compute capability.

(NVIDIA’s HTML Documentation seems to jump back to start when trying to select something or even moving the mouse cursor to left side of screen, highly annoying ! Please test documentation on IE9 !!)

Installed FireFox 35.0.1 to be able to copy this from documentation,

This table might be of some use for figuring out the meaning of PTX version field from kernel:

Table 27 shows the PTX release history.

Table 27. PTX Release History PTX ISA Version CUDA Release Supported Targets
PTX ISA 1.0 CUDA 1.0 sm_{10,11}
PTX ISA 1.1 CUDA 1.1 sm_{10,11}
PTX ISA 1.2 CUDA 2.0 sm_{10,11,12,13}
PTX ISA 1.3 CUDA 2.1 sm_{10,11,12,13}
PTX ISA 1.4 CUDA 2.2 sm_{10,11,12,13}
PTX ISA 1.5 driver r190 sm_{10,11,12,13}
PTX ISA 2.0 CUDA 3.0, driver r195 sm_{10,11,12,13}, sm_20
PTX ISA 2.1 CUDA 3.1, driver r256 sm_{10,11,12,13}, sm_20
PTX ISA 2.2 CUDA 3.2, driver r260 sm_{10,11,12,13}, sm_20
PTX ISA 2.3 CUDA 4.0, driver r270 sm_{10,11,12,13}, sm_20
PTX ISA 3.0 CUDA 4.2, driver r295 sm_{10,11,12,13}, sm_20
CUDA 4.1, driver r285 sm_{10,11,12,13}, sm_20, sm_30
PTX ISA 3.1 CUDA 5.0, driver r302 sm_{10,11,12,13}, sm_20, sm_{30,35}
PTX ISA 3.2 CUDA 5.5, driver r319 sm_{10,11,12,13}, sm_20, sm_{30,35}
PTX ISA 4.0 CUDA 6.0, driver r331 sm_{10,11,12,13}, sm_20, sm_{30,32,35}, sm_50
PTX ISA 4.1 CUDA 6.5, driver r340 sm_{10,11,12,13}, sm_20, sm_{30,32,35,37}, sm_{50,52}
PTX ISA 4.2 CUDA 7.0, driver r346 sm_{10,11,12,13}, sm_20, sm_{30,32,35,37}, sm_{50,52,53}


(FireFox font looks terrible… very thin/crisp… hate it)

This little bit of documentation basically confirms GPU applications not future backwards compatible. (So I wish other documentation about PTX would stop claiming that PTX will work on future versions, it probably will not):
6.1. GPU Generations

In order to allow for architectural evolution, NVIDIA GPUs are released in different generations. New generations introduce major improvements in functionality and/or chip architecture, while GPU models within the same generation show minor configuration differences that moderately affect functionality, performance, or both.

Binary compatibility of GPU applications is not guaranteed across different generations. For example, a CUDA application that has been compiled for a Fermi GPU will very likely not run on a next generation graphics card (and vice versa). This is because the Fermi instruction set and instruction encodings is different from Kepler, which in turn will probably be substantially different from those of the next generation GPU.
Next generation•…??..

Because they share the basic instruction set, binary compatibility within one GPU generation can be guaranteed under certain conditions. This is the case between two GPU versions that do not show functional differences at all (for instance when one version is a scaled down version of the other), or when one version is functionally included in the other. An example of the latter is the base Kepler version sm_30 whose functionality is a subset of all other Kepler versions: any code compiled for sm_30 will run on all other Kepler GPUs.

  • See more at: file:///C:/Program%20Files/NVIDIA%20GPU%20Computing%20Toolkit/CUDA/v7.0/doc/html/cuda-compiler-driver-nvcc/index.html#sthash.JWazjSPN.dpuf

The above documentation seems to conflict with the goals of PTX:

1.2. Goals of PTX

PTX provides a stable programming model and instruction set for general purpose parallel programming. It is designed to be efficient on NVIDIA GPUs supporting the computation features defined by the NVIDIA Tesla architecture. High level language compilers for languages such as CUDA and C/C++ generate PTX instructions, which are optimized for and translated to native target-architecture instructions.

The goals for PTX include the following: •Provide a stable ISA that spans multiple GPU generations.
•Achieve performance in compiled applications comparable to native GPU performance.
•Provide a machine-independent ISA for C/C++ and other compilers to target.
•Provide a code distribution ISA for application and middleware developers.
•Provide a common source-level ISA for optimizing code generators and translators, which map PTX to specific target machines.
•Facilitate hand-coding of libraries, performance kernels, and architecture tests.
•Provide a scalable programming model that spans GPU sizes from a single unit to many parallel units.

I do hope a more stable PTX in future…