Does Nsight makes the kernel/memory time longer?

butterbeer14 · April 24, 2025, 5:44am

When I Use cudaMemcpyToSymbol with recording the time by cudaEventRecord & cudaEventElapsedTime，the duration time was longer with Nsight System.

    cudaEvent_t start_event, stop_event;
    cudaEventCreate(&start_event);
    cudaEventCreate(&stop_event);
    cudaEventRecord(start_event, 0);
    for (int i = 0; i < times; ++i) {
        cudaMemcpyToSymbol(d_global_array, h_a, nbytes, 0, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop_event, 0);
    cudaEventSynchronize(stop_event);

    float single_time;
    cudaEventElapsedTime(&single_time, start_event, stop_event);
    single_time = single_time / times;
    std::cout << "time:" << single_time << " ms" << std::endl;

and the result is

user@cdserver:~/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams$ ./cudamemcpytosymbol 
time:0.0034125 ms
user@cdserver:~/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams$ nsys -v
NVIDIA Nsight Systems version 2025.2.1.130-252135690618v0
user@cdserver:~/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams$ nsys profile -t cuda ./cudamemcpytosymbol
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.

Collecting data...
time:0.0039857 ms
Generating '/tmp/nsys-report-d885.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
        /home/user/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams/report3.nsys-rep

It’s just an interesting little experiment. I wonder whether it’s more accurate to use the CPU to record timestamps, or if Nsight Systems itself has this kind of issue. How should I eliminate this difference?

hwilper · April 24, 2025, 1:20pm

Don’t worry about the warnings. Basically your command line requested tracing only CUDA, but since you didn’t turn off CPU sampling, it tried to do that as well, but since the system didn’t have high enough permissions to do CPU sampling, it sidestepped it and gave you the warning.

Nsys uses CUPTI under the covers to trace the CUDA APIs, that adds a little bit of overhead on a per kernel basis. Here, it appears to be on the order of .5 microseconds, which seems pretty reasonable to me. Note that the fact that that overhead is per kernel is one of the reasons that we encourage people to not write a lot of tiny kernels.

butterbeer14 · April 25, 2025, 2:11am

So does that mean that when I repeatedly call cudaMemcpyToSymbol multiple times, the average time obtained will increase as the number of repetitions increases?

hwilper · April 25, 2025, 8:43pm

Imagine that you are paying a set amount for every CUDA kernel called, and you have mostly kernels that take 500 msec, the overhead is probably reasonable. But if you have super short kernels, say 500 usec, than the overhead is the same on a per kernel level, but it will seriously dominate your run time.

Now the data that Nsys gives you will still be valid, however, the system is much more perturbed and disconnected from reality.

system · May 9, 2025, 8:43pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inconsistent results with nsight systems Profiling Embedded Targets	5	949	June 20, 2023
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	1243	April 3, 2024
Kernel time discrepancy between nsys profile and cudaEventElapsedTime Profiling Linux Targets cuda , kernel , profiling	4	868	April 28, 2023
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	443	April 23, 2024
cudaEventRecord and NSight Systems shows different duration for CUDA API Profiling Linux Targets nsight	2	762	September 13, 2022
captyre time of Nsight CUDA Programming and Performance	1	633	November 25, 2014
Inconsistent kernel time between nsight and cudaEvent Nsight Compute cuda	2	1809	June 12, 2024
Is the profiling session duration equivalent to total runtime when using Nsight Systems? Profiling Linux Targets cuda , kernel , profiling	11	724	May 6, 2024
Is the Nsight System accurate in measuring the execution time of the kernel? Profiling Linux Targets	14	2216	April 6, 2024
Slower in profiling CUDA Programming and Performance	1	510	October 31, 2016

Does Nsight makes the kernel/memory time longer?

Related topics