How to profile dynamic parallelism

Hi,

Is there any way to analyze dynamic parallelism by nvprof, nsight system or nsight compute?

Here is my system information

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2020 NVIDIA Corporation
Release version 11.1.69 (21)

NVIDIA Nsight Systems version 2020.3.4.32-52657a0

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2012-2020 NVIDIA Corporation
Version 2020.2.0 (Build 28964561)

My GPU is Tesla V100S

I tried both methods to get run-time results, but the whole dynamic parallelism related kernel will disappear when I use it

  1. nvprof

nvprof --export-profile timeline.prof -f my_app
and
nvprof my_app

  1. nsys

cat config.ini
HandleInvalidEvents=true
nsys profile -t cuda my_app

Any idea to measure/ profile dynamic parallelism. Thank you for your help.

Hi,

Tools nvprof and nsys don’t support tracing of dynamic parallelism (CDP) kernels for Volta (compute capability 7.0) and higher GPU architectures.

In the CUDA releases prior to version 11.4 , these tools error out early when CUDA module contains CDP kernels even when it is not launched. In CUDA 11.4, an improvement is made to trace all the host kernels until a CDP kernel is encountered. This is documented in the Profiler Known Issues section of the CUDA Profiler guide.

CDP kernel launch tracing has a limitation for devices with compute capability 7.0 and higher. CUPTI traces all the host launched kernels until it encounters a host launched kernel which launches child kernels. Subsequent kernels are not traced.

Hi,

If I change the GPU to Tesla P40 which is Pascal(compute capability 6.1), will it work?
Tools nvprof and nsys support tracing of dynamic parallelism (CDP) kernels for Pascal (compute capability 6.1), right?
Does this feature have any requirements for the CUDA version?

Thank you.

Yes, these tools support tracing of the CDP kernels for Pascal and older GPU architectures. There is no specific requirements for the CUDA version. The CUDA version you use i.e. 11.1 should work.

Okay, I will change a GPU.

My I ask if you have any teaching documents about CDP Profile for nsys and nvprof. I need some analytical guidance.

Thank you.

CUDA Programming Guide has a section CUDA Dynamic Parallelism which might help you. Sub-section Programming Guidelines can be of interest.

Further readings:
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://developer.nvidia.com/blog/a-cuda-dynamic-parallelism-case-study-panda/

Thank you sooooooo much :)

Hi, I have the same limitation with a RTX A4500. I’m under cuda 11.7 and I use the last nsight system release (2023.3.1). Should I upgrade cuda or any think else? Does it still a limitation today?

Sorry, upgrading the CUDA toolkit won’t help. Tools nsys and nvprof don’t support tracing of CDP kernels for Volta and higher GPU architectures.