Dependency analysis in Nsight

Hi
I have an application which I want to check kernels dependencies based on the execution data. I can see the order of kernel execution using nsight systems, but I am wondering if it shows the data dependency as well. Any thoughts on that?

Not directly, but I will loop in @jcohen to give you a more thorough explaination.

Hello @hwilper

Any updates? I’m also interested on that.

thanks

@jcohen – Jason?

Hi mahmood.nt and brasilino,

Nsight Systems doesn’t yet have a way to programmatically track dependencies between CUDA kernels. For example if you use CUDA Graphs, or if you use classic stream dependencies like cudaStreamWaitEvent, we can’t show that kernel B’s start was blocked until the completion of kernel A.

We did develop an internal tool a few years ago to do this, and also to automatically detect and highlight the critical path amongst all the dependencies. We quickly realized there’s a major difference between what I’d call “API dependencies” and “real dependencies”. An API dependency is something the API documentation says is part of the programming model – e.g. you have two kernels or memcpys launched into a CUDA stream, and the second won’t start until the first completes and its memory writes are guaranteed to be visible by the second. Or as mentioned above, using the CUDA Graphs or CUDA Event API functions to manually create dependencies. Our tool was able to model these properly. But the problem is with “real dependencies”, where the driver stack or the hardware implementation incurs additional implicit dependencies. These aren’t covered by the API model, because we might change how those details work from chip to chip, and we want to avoid having users’ code depend on implementation details that we want the flexibility to make better in the future without breaking everybody’s code. One example of implicit dependencies would be channel limits: While you can create an unlimited number of CUDA streams, those must be mapped to a finite number of hardware channels, and that number may be reduced further by the driver to limit overhead – see info for CUDA_DEVICE_MAX_CONNECTIONS. If you create more streams than there are available channels, some of the supposedly-independent streams will alias to the same channel, creating false dependencies. Although a number of these sources of false dependencies were easy to model, we couldn’t find a maintainable way to model all the CUDA driver’s implementation details, and without a complete picture of all sources of dependencies the tool couldn’t produce reliably correct answers. So it was never added to our shipping tools.

All that said, we have recently discussed trying to revive this project because people regularly ask for it, and we may be able to provide something that at least achieves the minor goal of illustrating API-level dependencies, without trying to solve the harder problem of doing accurate critical path analysis. Now that more people are doing SQL exports and running scripts to analyze the trace data, this is becoming a notable gap in the feature set. I will mention this forum post to the team and see if we can get something on the Nsight Systems roadmap.

1 Like