Embedding PTX to enable execution on more modern hardware

With nvcc and CUDA C, it is possible to specify a gencode argument pair of arch=compute_50,code=compute_50.

This embeds the SM50 PTX into the fat binary, so that at runtime JIT compilation can occur for any newer GPUs (i.e SM60/70).

With pgfortran, the options for -Mcuda are simply cc50 etc. If I just specify cc50, my code doesn’t seem to execute on more modern hardware.

Is there a way in pgfortran to embed the PTX in the fat binary, like it is possible with nvcc?

Hi ptheywood,

You can create fat binaries to support multiple target devices but using a list of devices. For example “-Mcuda=cc35,cc50,cc60,cc70”. By default “-Mcuda” will include multiple target devices depending upon the CUDA version selected. For CUDA 7.5 (default in PGI 17.10), this would be cc35 and cc50. For CUDA 8.0, we add Pascal (cc60) and with CUDA 9.0 Volta (cc70) is added.

Though we don’t include the PTX. Instead we embed the device binaries into the host executable.


Hi Mat,

I am aware of specifying multiple cc versions to create a fat binary, and already do so.

I was just wondering if a method of embedding PTX was possible or not via PGI, so that If i build an executable using CUDA 8.0 it would be possible to execute on Volta hardware, I.e. by embedding PTX for JIT compilation at runtime on more-modern hardware (-gencode arch=compute_XX,code=compute_XX).

Using the appropriate -gencode argument from nvcc does allow this,

For one of the executables we build usign pgfortran, we no longer use openacc, but rather link against CUDA C object files (which include PTX). The executable produced (using -cudalibs and -Mcuda=…) works correctly for the list of Mcuda arguments but does not work for newer architectures.
My thoughts are now to use nvcc to link the fortran and cuda-c object files which may allow the embedded PTX to link correctly?
I would need to find the correct linker arguments for this to work.


Hi Peter,

By default we compile using RDC in order to support device side linking. This precludes us from including the PTX in the final binary.

However, if you compile “-Mcuda=nordc”, then the PTX is included and may be what you need. The caveat being that features such as accessing device module data or calling device routines contained in external modules wont work. If all your modules are self-contained or if you pass in your device data to module routines, then you should be fine.


HI Mat,

Including nordc has enabled JIT compilation for newer hardware through PGI.