Inconsistent kernel execution times, and affected by Nsight Systems

We’re currently migrating an OpenCL app to CUDA, but we’re new to CUDA so are still getting up to speed. Our app consists of a C++ read loop that repeatedly receives a large buffer of data from an external device then executes a series of kernels to process that data. So far we’ve implemented the first kernel in the CUDA version, which essentially just copies the data from the incoming buffer into GPU memory. For test purposes the loop is currently iterating 1000 times.

We capture and log the kernel execution time using cudaEventCreate() and cudaEventRecord(), and for the first few hundred iterations this is reported as ~55us. It then jumps to 300-400us for the remaining 600 or so executions. Stranger still, if we run the app in Nsight then both Nsight and the above logging reports a consistent 55us for all 1000 kernel executions.
Why the the big jump in the times that we log? And why does nSight seem to have some effect on the profiling code? We’re not sure which times to trust!

On a different subject, does Nsight report a kernel’s average execution time anywhere? So far we’ve only been able to find the list of all individual execution times, but an average would be more useful given how many times these will be executing.

It’s possible that you might be hitting a queue depth limit, but that doesn’t really fit all the data you have provided. You might also be getting your start/stop interspersed into kernel executions in an unexpected fashion. You might also be hitting a power or thermal issue. I would probably need a test case to study. Windows vs linux might also be important.

Nsight systems will report average kernel execution time, in the stats output:

nsys profile --stats=true ./my_app