Does Nsight makes the kernel/memory time longer?

When I Use cudaMemcpyToSymbol with recording the time by cudaEventRecord & cudaEventElapsedTime,the duration time was longer with Nsight System.

    cudaEvent_t start_event, stop_event;
    cudaEventCreate(&start_event);
    cudaEventCreate(&stop_event);
    cudaEventRecord(start_event, 0);
    for (int i = 0; i < times; ++i) {
        cudaMemcpyToSymbol(d_global_array, h_a, nbytes, 0, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop_event, 0);
    cudaEventSynchronize(stop_event);

    float single_time;
    cudaEventElapsedTime(&single_time, start_event, stop_event);
    single_time = single_time / times;
    std::cout << "time:" << single_time << " ms" << std::endl;

and the result is

user@cdserver:~/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams$ ./cudamemcpytosymbol 
time:0.0034125 ms
user@cdserver:~/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams$ nsys -v
NVIDIA Nsight Systems version 2025.2.1.130-252135690618v0
user@cdserver:~/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams$ nsys profile -t cuda ./cudamemcpytosymbol
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.

Collecting data...
time:0.0039857 ms
Generating '/tmp/nsys-report-d885.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
        /home/user/nvidia/cuda-samples/Samples/0_Introduction/simpleStreams/report3.nsys-rep

It’s just an interesting little experiment. I wonder whether it’s more accurate to use the CPU to record timestamps, or if Nsight Systems itself has this kind of issue. How should I eliminate this difference?

Don’t worry about the warnings. Basically your command line requested tracing only CUDA, but since you didn’t turn off CPU sampling, it tried to do that as well, but since the system didn’t have high enough permissions to do CPU sampling, it sidestepped it and gave you the warning.

Nsys uses CUPTI under the covers to trace the CUDA APIs, that adds a little bit of overhead on a per kernel basis. Here, it appears to be on the order of .5 microseconds, which seems pretty reasonable to me. Note that the fact that that overhead is per kernel is one of the reasons that we encourage people to not write a lot of tiny kernels.

So does that mean that when I repeatedly call cudaMemcpyToSymbol multiple times, the average time obtained will increase as the number of repetitions increases?

Imagine that you are paying a set amount for every CUDA kernel called, and you have mostly kernels that take 500 msec, the overhead is probably reasonable. But if you have super short kernels, say 500 usec, than the overhead is the same on a per kernel level, but it will seriously dominate your run time.

Now the data that Nsys gives you will still be valid, however, the system is much more perturbed and disconnected from reality.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.