How to use CUPTI to get metrics for the Device Attributes

I am able to use ncu to get the metrics related to Launch Metrics, Source Metrics and Instructions Per Opcode Metrics (found here). However I am unable to use CUPTI to get the values after modifying the METRIC_NAME in the sample code /usr/local/cuda-11.8/extras/CUPTI/samples/callback_profiling/ I get an error

FAILED: NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest(&convertMetricToEvalRequest) with error NVPA_STATUS_INVALID_ARGUMENT.

How can I, for example, get the values like memory_l2_theoretical_sectors_global or sass__inst_executed_per_opcode?
Can someone let me know how do I get the values using CUPTI? Is there any Nvidia CUPTI sample for the same?

A small kernel to profile:

#define N (2048 * 8)

void __device__ add1_device(const int x, const int y, int *z)
    *z = x * y;

__global__ void dot(int *a, int *b, int *c)
    __shared__ int temp[THREADS_PER_BLOCK];
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    // temp[threadIdx.x] = a[index] * b[index];
    add1_device(a[index], b[index], &temp[threadIdx.x]);    // Comment this line and uncomment the previous to not use the _-device__ 


    if (threadIdx.x == 0)
        int sum = 0;
        for (int i = 0; i < THREADS_PER_BLOCK; i++)
            sum += temp[i];
        atomicAdd(c, sum);

    int *a, *b, *c;
    int *dev_a, *dev_b, *dev_c;
    int size = N * sizeof(int);

    //allocate space for the variables on the device
    cudaMalloc((void **)&dev_a, size);
    cudaMalloc((void **)&dev_b, size);
    cudaMalloc((void **)&dev_c, sizeof(int));

    //allocate space for the variables on the host
    a = (int *)malloc(size);
    b = (int *)malloc(size);
    c = (int *)malloc(sizeof(int));

    //this is our ground truth
    int prodTest = 0;
    //generate numbers
    for (int i = 0; i < N; i++)
        a[i] = rand() % 10;
        b[i] = rand() % 10;

    *c = 0;

    cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c, c, sizeof(int), cudaMemcpyHostToDevice);

    dot<<< N / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);


Thanks for reaching out. I have passed this information on to the engineering team and they are investigating. I’ll let you know as soon as I have some more information.

The ncu Launch Metrics, Source Metrics and Instructions Per Opcode Metrics (those listed under the Metrics Reference section in the ncu Kernel Profiling Guide ) are defines by ncu and are not directly supported by CUPTI.

The CUPTI callback_profiling sample which you are using is the correct sample to collect the supported metrics.
You can use the cupti_metric_properties sample to query the supported metrics and various metric properties. This list of metrics should match the metrics listed by ncu --query-metrics.