Unable to profile the metric CTC.TriageCompute.ctc__rx_bytes.avg on a machine with 1 * GH200

On a machine with 1 * GH200 using Cuda Toolkit 12.6.3, I am able to successfully query the metric properties of CTC.TriageCompute.ctc__rx_bytes.avg. This query includes the number of passes, collection method, metric type and the hardware unit associated. However, if I try to profile this metric using the CUPTI sample code concurrent_profiling.cu.txt the call to NVPW_RawMetricsConfig_AddMetrics fails with NVPA_STATUS_ERROR and the config image creation fails.

Does the Grace Hopper architecture not support profiling these metrics?

Hi, @tburgess

Sorry for this. CUPTI didn’t support this in CUDA 12.6
Please use latest CUDA toolkit with corresponding driver to have a try.

Hello @veraj,

Thank you for the reply. This problem continues for myself even when using Cuda Toolkit 12.8.1. The driver I am using is 570.124.06 and from my understanding this was released for Cuda Toolkit 12.8.

Hello @veraj,

I just wanted to follow up with my above comment.

Best wishes,

Treece

Hi, @tburgess

Can you please provide the exactly command line you executed and the output ?

To collect C2C metrics, you must use the CUPTI Range Profiling API. The sample concurrent_profiling utilizes CUPTI Profiling API, which doesn’t support these metrics.

To verify the availability of these metrics in your setup, we recommend using the range_profiling sample. Note that the metric values might be zero as the kernel launched in the sample does not utilize the C2C unit for transfer. For device level metrics (ctc, nvlink, pcie) you need to add some changes in the range_profiling.cu file, i.e. pass an empty counter availability image while setting up CUPTI profiler host APIs which is required before creating the config image. We will try to fix this work around in future release.

Command: ./range_profiling -e kernel -m <c2c metric>

Please be aware that the CUPTI Profiling API may be deprecated and potentially removed in future releases. Transitioning to the CUPTI Profiler Host API and Range Profiling API will ensure continued support and compatibility. For usage of these APIs, refer to the sample range_profiling.

@ssubudhi To clarify, when you state to collect C2C metrics do you mean CTC metrics or do CTC metrics fall under the umbrella of C2C?

Best wishes,

Treece

Sorry for the confusion—within our team, we often use CTC and C2C interchangeably. Thanks for pointing it out!

@ssubudhi No worries, thank you for clearing that up.

I would like to get further clarification with the below statement:

For device level metrics (ctc, nvlink, pcie) you need to add some changes in the range_profiling.cu file, i.e. pass an empty counter availability image while setting up CUPTI profiler host APIs which is required before creating the config image.

In range_profiling.cu, at line 169 you get get the counter availability image. Which currently if I run ./range_profiling -e kernel -m CTC.TriageCompute.ctc__rx_bytes.avg will result in the error CUPTI_ERROR_UNKNOWN occurring at the call to cuptiProfilerHostConfigAddMetrics.

Based off the statement “pass an empty counter availability image”, I commented out line 169 in range_profiling.cu. Doing this will see ./range_profiling -e kernel -m CTC.TriageCompute.ctc__rx_bytes.avg successfully run to end, but I get nan as the value.

This does prove the CTC metrics can be profiled/added with the CUPTI Range Profiling API, but is this what you intended by “pass an empty counter availability image”?

nan values aren’t expected in this context. Could you please share the changes you made to the range_profiling.cu file? I’m able to retrieve profiling data successfully on the GH200 chip. It would also help if you could include details like your driver version and CUDA toolkit version.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.