I am unable to see the cuda kernels launched from the device within another kernel and also cuda streams created from inside a cuda kernel in the NSYS report.
Is this a known limitation of NSYS?
Note: I haven’t shared any code here as it is more to do with feature support rather than issue with the execution.
Yes, its a known limitation of nsys. In the future, nsys-specific questions may be asked on the nsys forum.
Many thanks for a quick response again, Robert.
I have edited the tags in my post for it to reflect in the right forum.
Hi @Robert_Crovella,
I am using Ampere based GPU so unable to see the kernels launched from device.
This seems quite restrictive to be unable to profile the child kernels. The product/application I am working on is MIPS intensive and latency bound - so no visibility of some of the kernels makes it difficult to profile and optimize the code.
Are you aware of any other tools or methods to profile the kernels launched from the device? TIA.
We are working to add device launched kernel support. @liuyis can you confirm if this use case is in your planning?
Just to confirm - by “kernels launched from device”, are you referring to DGL (Device Graph Launch, see Enabling Dynamic Control Flow in CUDA Graphs with Device Graph Launch | NVIDIA Technical Blog) or CDP (CUDA Dynamic Parallelism) that Robert mentioned?
If it is DGL, then Nsys is working on adding that support for Blackwell GPUs.
If it is CDP, unfortunately it is not in the road map.
1 Like
Thanks @hwilper and @liuyis for your responses.
I am referring to CDP - kernels launched from the device, not device graphs.
Regarding DGL, when that is supported in Nsys, will it be only for Blackwell GPUs and not the older architectures? TIA
The initial support will be limited to Blackwell GPUs as it depends on a hardware feature. We are also thinking about supporting older archs with software patching, but there hasn’t been a specific plan yet.
1 Like