Strange difference between CUPTI results and nvprof

I tried callback_metric sample of CUPTI 10 on Ubuntu 18.04 and GeForce MX150, then I verified its results by nvprof of CUDA 10. sometimes they approve each other, but sometimes CUPTI returns extra agglomerated values.

for vector add with size = 1 << 20 CUPTI results are:


CUDA Device Number: 0
CUDA Device Name: GeForce MX150
Launching kernel: blocks 4096, thread/block 256
Pass 0
Launching kernel: blocks 4096, thread/block 256
l2_subp0_total_read_sector_queries = 131096 (65544, 65552)
l2_subp0_total_read_sector_queries (normalized) (131096 * 2) / 2 = 131096
l2_subp1_total_read_sector_queries = 131088 (65544, 65544)
l2_subp1_total_read_sector_queries (normalized) (131088 * 2) / 2 = 131088
Metric l2_read_transactions = 262184

and sometimes are:


CUDA Device Number: 0
CUDA Device Name: GeForce MX150
Launching kernel: blocks 4096, thread/block 256
Pass 0
Launching kernel: blocks 4096, thread/block 256
l2_subp0_total_read_sector_queries = 131101 (65540, 65561)
l2_subp0_total_read_sector_queries (normalized) (131101 * 2) / 2 = 131101
l2_subp1_total_read_sector_queries = 262200 (131080, 131120)
l2_subp1_total_read_sector_queries (normalized) (262200 * 2) / 2 = 262200
Metric l2_read_transactions = 393301

but nvprof always returns:


==25701== Event result:
Invocations Event Name Min Max Avg Total
Device “GeForce MX150 (0)”
Kernel: kernel(int const *, int const , int, int)
1 l2_subp0_total_read_sector_queries 131108 131108 131108 131108
1 l2_subp1_total_read_sector_queries 131093 131093 131093 131093

==25701== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GeForce MX150 (0)”
Kernel: kernel(int const *, int const , int, int)
1 l2_read_transactions L2 Read Transactions 262201 262201 262201

what’s wrong with CUPTI?

We are trying to repro the issue here. Once the issue is reproed, will update on this thread.

FYI, nvprof also internally uses CUPTI to profile metrics

Thanks

Hi there,

Any update on this thread?

I am observing very similar behavior. In fact, both CUPTI and NVPROF results do not make sense.
I am doing the following simple application to see if the counters behave as I expect it to behave:

1 thread block with 1 thread. The thread writes to a volatile memory location with a constant value. The reason why volatile is used is to ensure that it goes to the global memory. However when I profile the following counters using CUPTI/NVPROF: l2_subp0_write_sector_misses and l2_subp1_write_sector_misses - sometimes I get values to be 0 and sometimes some number of sectors.

Other device-specific configurations:

  1. Using Titan V GPU with persist mode disabled and clocks clocked to max speed.
  2. cudaDeviceReset() and cudaDeviceSynchronize() before the start of profiling and execution of the kernel.

With the above-mentioned procedure, repeated runs provide different counter values and they vary a lot. Any help much appreciated.

1 Like