Cupti instrumentation overhead

I’m new to using nsight systems

I want to profile the DLRM model inference using nsight systems on ubuntu.
But It seems that the overhead of the profiler is too high on the first iteration of inference because of the CUPTI instrumentation as below
(the unrolled_elementwise_kernel part is the first iteration.)

My command was nsys profile -c cudaProfilerApi -t nvtx,cuda python <OTHER_ARGS>

Why is this happening?
Can I somehow filter out the first iteration with CUPTI instrumentation overhead?
Or can I reduce the overhead in some way?

Thanks in advance

I am running into the exact same issue with the PyTorch profiler which is based off of CUPTI. I am seeing 4 seconds of the CUpti_ActivityOverhead activities before start seeing my runtime and kernel activities.