Which CUDA Toolkit did you use to compile the PTX?
If this was with CUDA 8.0, that behaves more aggressive with respect to dead code elimination on function level than previous CUDA compilers. Callable programs are real functions and if the compiler doesn’t find a call to these in the code, it assumes it’s not used and doesn’t generate code for this function.
To solve that you need to force code generation for callable programs when using CUDA 8.0. Please try adding the command line option “–relocatable-device-code=true” to your nvcc options and compare the PTX results before and after.
For example, this is what my nvcc options look like in a CMakeLists.txt for an OptiX 4.0.2 project using CUDA 8.0:
NVCC_OPTIONS "--gpu-architecture=compute_30" "--use_fast_math" "--relocatable-device-code=true" "-I${OPTIX_INCLUDE_DIR}" "-I${CMAKE_CURRENT_SOURCE_DIR}/shaders"