Cuda application segfaulting at __cudaUnregisterFatBinary() (NVCC 10.2) versus __cudaUnregisterFatBinary() (NVCC 10.0)

Is there a big difference in terms of the implementation between __cudaUnregisterFatBinary() (NVCC 10.2) and __cudaUnregisterFatBinary() (NVCC 10.0).

My program segfaults at __cudaUnregisterFatBinary() (NVCC 10.2) inside libcudart.so, and works just fine with the 10.0 version. I am calling the cudaUnregisterFatBinary() and cudaRegisterFatBinary() from different .cpp files (I am doing some function interposition - aka wrapping CUDA calls). And also I am making sure to pass the correct arguments (void **fatCubinHandle). Any thoughts if there’s a huge difference between the two nvcc version that might make my app fail!

The only solution I have left is to disable this call, once inside the cudaUnregisterFatBinary() wrapper, I just exit before calling the real cudart implementation! Could that affect the GPU device memory?

The difference between both versions is the that cuda 10.2 requires the cudaRegisterFatBinaryEnd right after calling cudaRegisterFatBinary, whereas cuda 10.0 does not require this call!!!