Anyone know how to debug & profile the CUDA shared object used by JNI program

I’m using JNI to implement the JAVA native function which is using CUDA shared object(.so)
I create the header with JNI. Include the header file in the .cu source file.
Compile it with NVCC --shared option.
Link it to Java Program.
It works.
But I found that there is no way for me to debug the CUDA program, cos I can only start the application with Java Program.
How can I make a breakpoint in the shared object which is realized by CUDA.
As well how can I profile the shared object.

nvprof should be able to profile it if you use either the profile-child-processes option, or use the continuous profiling methodology. I suggest reading the nvprof manual.