we have switched to CUDA 8.0 and create our kernels using NVCC with the command line:
-arch sm_20 -ptx
We store the rrsulting PTX values are local resources and load them on demand from the file system, using the Driver API, with cuLoadModuleEx. The compute capability is set to lowest so that we can support as many systems as possible.
Before CUDA 8 everything worked fine, but now compiling code with the commandline above results in “binary not for this GPU” exception when loading the PTX code using the jitter for a K20.
I was always thinking that PTX code is backwards compatible when using the -arch flag, but since also the ISA version is burnt into the PTX code and reading the documentation this lets me rethink if this is really true.