Why the utilization from kernel activity records is not equal to GPU Utilization?

In NVML, GPU utilization is defined as follows:

typedef struct nvmlUtilization_st
{
unsigned int gpu; //!< Percent of time over the past second during which one or more kernels was executing on the GPU
unsigned int memory; //!< Percent of time over the past second during which global (device) memory was being read or written
} nvmlUtilization_t;

Based on the above definition, I thought I could derive GPU utilization using CUPTI kernel activity records by checking the kernel active time and idle time at regular intervals.

However, GPU utilization derived from CUPTI kernel activity records is always lower than the GPU utilization reported by NVML, as shown in the following figure. In this figure, the orange bar represents CUPTI, and the gray bar represents NVML…

This difference becomes more severe when multiple processes use the same GPU device. For example, in the case where two GPU processes are training cifar10-efficientnet B0 with a batch size of 2 (which is the first experiment case in the above graph), CUPTI shows 50% GPU utilization(similar to two times of ratios in first experiment case), while NVML shows 90% GPU utilization.

From multiple tests, I have concluded that GPU utilization derived from kernel activity records cannot accurately reflect GPU utilization as reported by NVML. Instead, GPU Utilization in NVML appears to be more closely aligned with GR_Engine Active which is basically not derived from kernel execution time.

However, I am still not completely certain about this conclusion. Could you please help me verify this issue?

I confirmed that when utilizing the queued times and completion timestamps through cuptiActivityLatencyTimestamps(1), the results are similar to the process utilization collected via NVML. However, it is unclear whether this truly reflects the criteria NVML uses to determine the active state of the kernel.