By default nsys seems to only provide profiling for the global kernel being called.
Let’s say if I have a global kernel k1, which in turn calls some device kernels k2,k3,k4. Is there an option that I can give nsys to let it produce a breakdown of k1 as to how long each of k2,k3,k4 is running for?
CUPTI doesn’t provide timing information for device functions. Please check if clock() or clock64() functions provided by CUDA helps you. Documentation of these functions is available at Programming Guide :: CUDA Toolkit Documentation