I am trying to profile my application (store/load only application) using the cuda event API. The execution time given by the cuda event was similar to the time given by nsight systems. However, when I disabled the JIT compilation and compiled again using --gpu-architecture=compute_72 --gpu-code=sm_72 (since I am on the Jetson AGX Xavier), I am getting different results with cuda event (for example I am getting 6 microsecond with the cuda event and getting 2ms with nsight systems). The values with nsight systems didn’t change before and after the deactivation of the JIT.
If you need further information please let me know.
Update: I discovered something that when I run the application with sudo, it works. I figured out that the kernel executes when I run the app with the command sudo. It seems the driver cannot access the binary file without the sudo command.(very weird)
Thank you for your support.