DLProf Pytorch NVTX annotations overhead

Hi, I am currently profiling inference run of a pytorch model on an NVIDIA RTX GeForce 3080 GPU in Lambda desktop PC. I am using DLprof and Nsight systems from Nvidia. I had used nsight systems exclusively so far , and noticed that I wasn’t able to trace the link between pytorch python layer APIs to the actual kernels launched on GPU. Due to this, I tried DLprof tool, which is supposed to provide you that missing link through NVTX annotations.
After installing DL Prof without containers, into your regular python virtual environment, I added the following lines in my model python file, as per the user guide:

import nvidia_dlprof_pytorch_nvtx

and ran the following command : dlprof --mode=pytorch --nsys_profile_range=true python <application with args>
What I noticed with these lines is a significant increase in the runtime of the model inference itself. The image below shows the runtime (~160 ms) with the above lines added to the model :

… And the image below here, shows the runtime (~30 ms) for the exact same application with no change in arguments/environment, exact same iteration under observation as above, but without the NVTX annotation import and initialization lines:

Is this expected, or is there something I am missing? (nearly 5x runtime difference). The reason I also bring this is, all the aggregated stats reported by DLProf using the first image shown above, like %age GPU utilization seems to be off, and way smaller than expected, due to addition of the NVTX annotations and the corresponding runtime bloat.