On a machine with 1 * GH200 using Cuda Toolkit 12.6.3, I am able to successfully query the metric properties of CTC.TriageCompute.ctc__rx_bytes.avg. This query includes the number of passes, collection method, metric type and the hardware unit associated. However, if I try to profile this metric using the CUPTI sample code concurrent_profiling.cu.txt the call to NVPW_RawMetricsConfig_AddMetrics fails with NVPA_STATUS_ERROR and the config image creation fails.
Does the Grace Hopper architecture not support profiling these metrics?
Thank you for the reply. This problem continues for myself even when using Cuda Toolkit 12.8.1. The driver I am using is 570.124.06 and from my understanding this was released for Cuda Toolkit 12.8.
To collect C2C metrics, you must use the CUPTI Range Profiling API. The sample concurrent_profiling utilizes CUPTI Profiling API, which doesn’t support these metrics.
To verify the availability of these metrics in your setup, we recommend using the range_profiling sample. Note that the metric values might be zero as the kernel launched in the sample does not utilize the C2C unit for transfer. For device level metrics (ctc, nvlink, pcie) you need to add some changes in the range_profiling.cu file, i.e. pass an empty counter availability image while setting up CUPTI profiler host APIs which is required before creating the config image. We will try to fix this work around in future release.
Please be aware that the CUPTI Profiling API may be deprecated and potentially removed in future releases. Transitioning to the CUPTI Profiler Host API and Range Profiling API will ensure continued support and compatibility. For usage of these APIs, refer to the sample range_profiling.
@ssubudhi No worries, thank you for clearing that up.
I would like to get further clarification with the below statement:
For device level metrics (ctc, nvlink, pcie) you need to add some changes in the range_profiling.cu file, i.e. pass an empty counter availability image while setting up CUPTI profiler host APIs which is required before creating the config image.
In range_profiling.cu, at line 169 you get get the counter availability image. Which currently if I run ./range_profiling -e kernel -m CTC.TriageCompute.ctc__rx_bytes.avg will result in the error CUPTI_ERROR_UNKNOWN occurring at the call to cuptiProfilerHostConfigAddMetrics.
Based off the statement “pass an empty counter availability image”, I commented out line 169 in range_profiling.cu. Doing this will see ./range_profiling -e kernel -m CTC.TriageCompute.ctc__rx_bytes.avg successfully run to end, but I get nan as the value.
This does prove the CTC metrics can be profiled/added with the CUPTI Range Profiling API, but is this what you intended by “pass an empty counter availability image”?
nan values aren’t expected in this context. Could you please share the changes you made to the range_profiling.cu file? I’m able to retrieve profiling data successfully on the GH200 chip. It would also help if you could include details like your driver version and CUDA toolkit version.