We are trying to test the CUDA runtime dynamic cubin loading according to the current GPU architecture.
We have one CUDA accelerated algorithm which is implemented in two versions under the same function signature: one with dynamic parallelism, one without.
We use conditional compilation (CUDA_ARCH) to generate two cuda binaries realizing the same function with different bodies targeting k2000 without dynamic parallelism (compute_20) and k2200 with dynamic parallelism (compute_35) respectively.
The code executes well on k2000 by loading the non-dynamic-parallelism binary. However when executing on k2200, the cuda runtime does not load the dynamic-parallelism binary as expected and always load the same non-dynamic-parallelism binary.
Below is the test environment we used:
Code Generation: compute_20,sm_20;compute_35,sm_35
GPU under test: Quadro k2200 (Compute capability 5.0), Quadro k2000 (Compute capability 3.0)
NVCC Compilation type: Generate hybrid object file (–compile)
The generated files seem to be coherent:
Could you give some hints on how making work the dynamic loading? Or are we understanding wrong about it?
Thank you very much,