NVPROF & NV_NSIGHT are much slower than adding CUPTI to the code

So, I am running LULESH Proxy app on multi V100 cards and the LULESH app records the runtime as 18 s, but when I add CUPTI to the code to just monitor the number of instructions executed it finishes up with 60 s, but using nvprof --events inst_executed I got about 21000 s and using nv-nsight-cu-cli --metrics inst_executed it finishes up after 116000 s, I really need to know why the huge difference between CUPTI as in code instrumentation and in NVPROF and NV-NSIGHT.

Hi,

For profiling, nvprof and Nsight Compute serialize all the kernels in the application, thus application runtime can take longer to execute. What versions of nvprof and Nsight Compute are used?

CUPTI exposes two collection modes - kernel and continuous. In the kernel collection mode, kernel launches are serialized, while continuous mode retains the kernel concurrency.
Refer to the Profiling Overhead section in the CUPTI documentation for more details: https://docs.nvidia.com/cupti/Cupti/r_main.html#r_overhead_profiling

What collection mode did you use for the in code CUPTI instrumentation? Can you please share the pseudo code for it?

NVPROF: 11.0.221 (21)
NV-NSIGHT: 2020.1.2 (Build 28820667)

CUPTI_CALL(cuptiSetEventCollectionMode(ctx, CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS));

So I am running in continuous mode, but even though am only running MPI on 8 nodes, so I would see that if we serialize the 8 nodes works, it may go for 8x the time, but the difference between 60s and 21000s is really huge.

In the kernel collection mode, all the kernels are serialized even on the same GPU. You can check how many CUDA streams are launching the kernels on each GPU, that would give a rough idea of the order of the kernel concurrency in the application.

To confirm whether in code CUPTI instrumentation and nvprof numbers are in agreement, you can use kernel mode for the in code instrumentation. Note that nvprof uses CUPTI under the hood for profiling, so I expect numbers to be in the close proximity.

So, how can I know using one of the Nvidia tools the number of CUDA streams in the application?
and is there a way in NVPROF that we can collect something like --events inst_executed and also the kernels time trace at the same time?

You can use nvprof or Nvidia Visual Profiler to identify the number of CUDA streams and order of kernel concurrency in the application. Kernel concurrency can be easily visualized in the timeline view of Visual Profiler.

nvprof supports profiling of events and time tracing at the same time. Use option --trace gpu along with the option --events. Note that timing information thus collected might not match with the trace only run since application behavior is changed under the profiling session.
Sample command:
$nvprof --events inst_executed --trace gpu