Some questions about one metric

Hi! To test the metric (sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime), this code is writen. But when we change the order of testsize (for example, from 100/32 to 32/100), the result also changes. And a percentage over 100% appears. What does it mean? Any idea about these result?

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void VectorSummation(int *a,int arraysize) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx<arraysize){
        a[idx]=a[idx]+1;
    }
}

void InitializeVectorInt(int *vector,int arraysize,int val) {
    for(int i=0;i<arraysize;++i) {
        vector[i] = val;
    }
}

void test(int arraysize) {
    int *a, *dev_a;
    int threadNum = 256;

    cudaHostAlloc((void**)&a,arraysize*sizeof(int),cudaHostAllocDefault);
    InitializeVectorInt(a,arraysize,1);
    cudaMalloc((void**)&dev_a,arraysize*sizeof(int));
    cudaMemcpy(dev_a,a,arraysize*sizeof(int),cudaMemcpyHostToDevice);
    VectorSummation<<<(arraysize+threadNum-1)/threadNum,threadNum>>>(dev_a,arraysize);

    cudaFree(a);
    cudaFree(dev_a);
}

int main() {
    int testsize;

    testsize = 100;
    test(testsize);
    testsize = 32;
    test(testsize);

    return 0;
}

There are multiple issues with your code.

Since you allocate a with cudaHostAlloc, you must free it with cudaFreeHost, see the documentation for cudaHostAlloc. I recommend that you check all API return codes as this cudaFree calls returns cudaErrorInvalidValue.

You also need to synchronize after your kernel launch, as cudaFree does not serve as a synchronization API. You can e.g. add a call to cudaDeviceSynchronize.

As for your question of metric values, please paste the output you are seeing, the Nsight Compute version and the GPU you are trying this on and we can check if results are within the expected thresholds or not. Note that if you are looking for precise, reproducible values for the number of executed instructions, you can collect e.g. SASS-based metrics sm__sass_thread_inst_executed_pred_on.sum and sm__sass_inst_executed.sum.

Thank you for your suggests, and changes have been made in our updated code as bellow:

__global__ void VectorSummation(int *a,int arraysize) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx<arraysize){
        a[idx]=a[idx]+1;
    }
}

void InitializeVectorInt(int *vector,int arraysize,int val) {
    for(int i=0;i<arraysize;++i) {
        vector[i] = val;
    }
}

void test(int arraysize) {
    int *a, *dev_a;
    int threadNum = 256;

    cudaHostAlloc((void**)&a,arraysize*sizeof(int),cudaHostAllocDefault);
    InitializeVectorInt(a,arraysize,1);
    cudaMalloc((void**)&dev_a,arraysize*sizeof(int));
    cudaMemcpy(dev_a,a,arraysize*sizeof(int),cudaMemcpyHostToDevice);
    VectorSummation<<<(arraysize+threadNum-1)/threadNum,threadNum>>>(dev_a,arraysize);

    cudaFreeHost(a);
    cudaFree(dev_a);
    cudaDeviceSynchronize();
}

int main() {
    int testsize;

    testsize = 100;
    test(testsize);
    testsize = 32;
    test(testsize);

    return 0;
}

When we execute mentioned instructions, results return as:

  1. sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime
-------------------------------------------------------------------------------------------------------------------
sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.pct             %                     84.21
sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.ratio                                 26.95
-------------------------------------------------------------------------------------------------------------------
sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.pct             %                     107.69                                                                           
sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.ratio                                 34.46
-------------------------------------------------------------------------------------------------------------------
  1. sm__sass_thread_inst_executed_pred_on.sum
-------------------------------------------------------------------------------------------------------------------
sm__sass_thread_inst_executed_pred_on.sum                        inst                          2136
-------------------------------------------------------------------------------------------------------------------
sm__sass_thread_inst_executed_pred_on.sum                        inst                          1728
-------------------------------------------------------------------------------------------------------------------
  1. sm__sass_inst_executed.sum
-------------------------------------------------------------------------------------------------------------------
sm__sass_inst_executed.sum                                    inst                               76
-------------------------------------------------------------------------------------------------------------------
sm__sass_inst_executed.sum                                    inst                               55
-------------------------------------------------------------------------------------------------------------------

It seems that there is no difference between original code and updated code. Do u have any idea about these result?

_realtime metrics are less accurate than !_realtime metrics. The _realtime metrics allow the tool to collect significantly more counters per pass. When collected over a range these metrics can be off by ± metric[!_realtime].burst_rate x #instances. If the range is extremely small (your example) then the error in these counters can result in out of range values. If you increase your array size by 10-100x you will see that the value converge but may always be off just slightly.

sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.ratio = sm__thread_inst_executed_pred_on_realtime.avg / sm__inst_executed_realtime.avg.

smsp__thread_inst_executed_pred_on can increment from 0-32 per cycle on Volta - Ampere or 0-64 per cycle on Fermi - Pascal. This is a SM sub-partition counter. Counters that increment by >1 per cycle are more expensive to capture and may even require multiple passes.

Turing+ has the metric sm__thread_inst_executed_pred_on_realtime. Unlike the base counter this is internally rolled up to the SM level. The SM has an internal counter than increments by 0-128 per cycle. However, it only sends a 1-bit pulse to the performance monitor every 128. This means the counter can an error of 127 per SM.

Tuiring+ has the metric sm__inst_executed_realtime. The SM has an internal counter than increments by 0-4 per cycle. However, it only sends a 1-bit pulse to the performance monitor every 4. This means the counter can an error of 3 per SM.

For longer collection periods the error will reduce and you should have the same value. For example if you increase

The sm__sass_* metrics are collected by instrumenting the kernel assembly code and will be 100% accurate for the code that is instrumented.

Hi,
I see you are discussing about metrics sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime
and I have a problem about it too. I try the same code shared by @ kyle.li and reproduce the same metrics value (ratio: 34.46, which larger than warp size).

I want to know how it return a value of 34.46.

Now I test sm__sass_thread_inst_executed_pred_on, output as follows.

---------------------------------------------------------------------- --------------- ------------------------------
    sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.pc               %                            100
    t                                                                                                                    
    sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.ra                                             32
    tio                                                                                                                  

    sm__sass_thread_inst_executed_pred_on.avg                                         inst                          37.57
    sm__sass_thread_inst_executed_pred_on.max                                         inst                           1728
    sm__sass_thread_inst_executed_pred_on.min                                         inst                              0
    sm__sass_thread_inst_executed_pred_on.sum                                         inst                           1728

    sm__thread_inst_executed_pred_on_realtime.avg                                     inst                          36.17
    sm__thread_inst_executed_pred_on_realtime.max                                     inst                           1664
    sm__thread_inst_executed_pred_on_realtime.min                                     inst                              0
    sm__thread_inst_executed_pred_on_realtime.sum                                     inst                           1664

    smsp__sass_thread_inst_executed_pred_on.max                                       inst                            576
    smsp__sass_thread_inst_executed_pred_on.min                                       inst                              0
    smsp__sass_thread_inst_executed_pred_on.sum                                       inst                           1728
    ---------------------------------------------------------------------- --------------- ------------------------------

If sm_sass* metrics are 100% accurate, I have several questions.

  1. Is 100% accurate stands for the real count? Why sm__sass_thread_inst_executed_pred_on.max is just a multiple of 128. Is it a coincidence?
  2. As for metrics suffixed with _realtime like sm__thread_inst_executed_pred_on_realtime, counter can an error of 127, is it floor division or ceilling division? Max error of 127 lie on the scope of a SM or complete GPU system?
  3. Why does adding a new metric (sm__sass_thread_inst_executed_pred_on) cause a change that sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.pct to be 100%. Can i call it a Heisenberg Uncertainty problem? Why?
  4. After adding a next metric (smsp__sass_thread_inst_executed_pred_on.sum), I found this warp (only has one thread block with a active warp) can jump up to 3 smsp (smsp__sass_thread_inst_executed_pred_on.sum / smsp__sass_thread_inst_executed_pred_on.max == 3, 1728 / 576). Can I summary a conclusion that a warp can be sheduled across SMSP? And can across SM? Futhermore, can a thread block be sheduled across SM?

Thanks in advance!

1 Like

Hi, Greg! Thank you for your explaination, and now I have a better understanding of the question. But I still have some issues. As you said “one SM only sends a 1-bit pulse to the performance monitor every 128, causing an error of 127 per SM”, I wonder whether it means the accounted increments is always smaller than the actual increments. If that is true, ratio in our metric will always be smaller than 100%. However, the output result is not in that way. Do you have any idea about it?

If sm_sass* metrics are 100% accurate, I have several questions.

  1. Is 100% accurate stands for the real count? Why sm__sass_thread_inst_executed_pred_on.max is just a multiple of 128. Is it a coincidence?

That is a coincidence.

  1. As for metrics suffixed with _realtime like sm__thread_inst_executed_pred_on_realtime , counter can an error of 127, is it floor division or ceilling division? Max error of 127 lie on the scope of a SM or complete GPU system?

It depends on the architecture. Turing does not reset the internal counter at start of collection so it would start with a random value 0-127 per SM which can lead to overcounting. On newer architectures the internal counters will reset which means there is an undercount. On newer architectures the internal counter is also reset to 0 on context switch.

  1. Why does adding a new metric ( sm__sass_thread_inst_executed_pred_on ) cause a change that sm__average_thread_inst_executed_pred_on_per_inst_executed_realtime.pct to be 100%. Can i call it a Heisenberg Uncertainty problem? Why?

See above comment regarding random initial state and times when internal counters are reset.

  1. After adding a next metric ( smsp__sass_thread_inst_executed_pred_on.sum ), I found this warp (only has one thread block with a active warp) can jump up to 3 smsp ( smsp__sass_thread_inst_executed_pred_on.sum / smsp__sass_thread_inst_executed_pred_on.max == 3, 1728 / 576). Can I summary a conclusion that a warp can be sheduled across SMSP? And can across SM? Futhermore, can a thread block be sheduled across SM?

For existing GPUs a warp is scheduled to a single SM SMSP for its lifetime. If a context switch occurs the warp is restored to the same SM and same SMSP. The only exception is for CUDA Dynamic Parallelism which allows a thread block that is preempted and continued to be restored to a different SM. This behavior is not guaranteed by the programming model.

The code posted in the original message always launches 8 warps so I think you are making an incorrect assumption.

int threadNum = 256;     // 256 / 32 = 8
VectorSummation<<<(arraysize+threadNum-1)/threadNum,threadNum>>>(dev_a,arraysize);
2 Likes

Thanks for your reply, Greg! It’s awesome.

As for question 4, metrics smsp__sass_thread_inst_executed_pred_on comes back to a regular value when I set threadNum 32.

Thank you very much.