Compile Modfied PTX code into binary

I’d like to take the ptx generated from a .cu file, modify it and then re introduce it to the compilation. I found these steps online:

# nvcc -g matrix.cu -rdc=true --ptx # generate ptx of device code in matrix.cu
# nvcc -g matrix.ptx -dlink # generate objectfile representing device code
# nvcc -g matrix.cu -rdc=true --compile # generate host code with device stubs
# nvcc -g a_dlink.o matrix.o --lib -o matrix.lib # link host with stubs to device code
# nvcc -g matrix.lib # make executeable

I would modify matrix.ptx before running step 2, but it seems the second to last step is completely ignoring the a_dlink.o that contains the modified kernel. From what I understand from the NVCC manual, -rdc=true (relocatable-device-code) should remove the device code from the generated host object in step 3 so that it can be linked later, but when I remove a_dlink.o from the second to last step, my program is unchanged from the source .cu file; the modified PTX seems to be ignored completely. My input file matrix.cu has both host and device code in it. Is that the problem? I know the “traditional” way to use PTX is with the Driver API but I would like to use it with the runtime API if at all possible.