Too long runtime with ncu

When I run my app (./myapp) that is nn learning in brief, one batch of data is processed in ~30sec.

When I run it like: ncu -o report ./myapp, it didn’t even processed the first batch in 3 hours (each kernel processed 9 times). There was warning that kernel replay may slow down and use replay mode application.
So I run it with —replay-mode application, but it’s is still slow.
Is such slow downs are normal?
Is ncu-ui faster and should I use it for boost?

Based on your options all kernels will be profiling and all metrics for sections in the default set will be collected.
The slowdown can be due to a large number of kernels and collecting a large number of metrics.
Refer the overhead section in the Kernel Profiling Guide :: Nsight Compute Documentation
You can try to collect data for fewer kernels by using the –launch-count option and collect a smaller set of metrics using the –metrics or –section options.
You can using Nsight Systems to determine which kernels to profile or use Nsight Compute to collect some metric like gpu__time_duration.sum or gpc__cycles_elapsed.max to decide which kernels to collect detailed metrics for.