Strange difference between CUPTI results and nvprof

negar.vega · March 2, 2019, 11:04pm

I tried callback_metric sample of CUPTI 10 on Ubuntu 18.04 and GeForce MX150, then I verified its results by nvprof of CUDA 10. sometimes they approve each other, but sometimes CUPTI returns extra agglomerated values.

for vector add with size = 1 << 20 CUPTI results are:

CUDA Device Number: 0
CUDA Device Name: GeForce MX150
Launching kernel: blocks 4096, thread/block 256
Pass 0
Launching kernel: blocks 4096, thread/block 256
l2_subp0_total_read_sector_queries = 131096 (65544, 65552)
l2_subp0_total_read_sector_queries (normalized) (131096 * 2) / 2 = 131096
l2_subp1_total_read_sector_queries = 131088 (65544, 65544)
l2_subp1_total_read_sector_queries (normalized) (131088 * 2) / 2 = 131088
Metric l2_read_transactions = 262184

and sometimes are:

CUDA Device Number: 0
CUDA Device Name: GeForce MX150
Launching kernel: blocks 4096, thread/block 256
Pass 0
Launching kernel: blocks 4096, thread/block 256
l2_subp0_total_read_sector_queries = 131101 (65540, 65561)
l2_subp0_total_read_sector_queries (normalized) (131101 * 2) / 2 = 131101
l2_subp1_total_read_sector_queries = 262200 (131080, 131120)
l2_subp1_total_read_sector_queries (normalized) (262200 * 2) / 2 = 262200
Metric l2_read_transactions = 393301

but nvprof always returns:

==25701== Event result:
Invocations Event Name Min Max Avg Total
Device “GeForce MX150 (0)”
Kernel: kernel(int const *, int const , int, int)
1 l2_subp0_total_read_sector_queries 131108 131108 131108 131108
1 l2_subp1_total_read_sector_queries 131093 131093 131093 131093

==25701== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GeForce MX150 (0)”
Kernel: kernel(int const , int const , int*, int)
1 l2_read_transactions L2 Read Transactions 262201 262201 262201

what’s wrong with CUPTI?

SagarAgrawal · March 6, 2019, 6:43am

We are trying to repro the issue here. Once the issue is reproed, will update on this thread.

FYI, nvprof also internally uses CUPTI to profile metrics

liuty10 · October 18, 2019, 8:42pm

Thanks

vsm2 · December 8, 2019, 12:39am

Hi there,

Any update on this thread?

I am observing very similar behavior. In fact, both CUPTI and NVPROF results do not make sense.
I am doing the following simple application to see if the counters behave as I expect it to behave:

1 thread block with 1 thread. The thread writes to a volatile memory location with a constant value. The reason why volatile is used is to ensure that it goes to the global memory. However when I profile the following counters using CUPTI/NVPROF: l2_subp0_write_sector_misses and l2_subp1_write_sector_misses - sometimes I get values to be 0 and sometimes some number of sectors.

Other device-specific configurations:

Using Titan V GPU with persist mode disabled and clocks clocked to max speed.
cudaDeviceReset() and cudaDeviceSynchronize() before the start of profiling and execution of the kernel.

With the above-mentioned procedure, repeated runs provide different counter values and they vary a lot. Any help much appreciated.

Strange difference between CUPTI results and nvprof

==25701== Metric result: Invocations Metric Name Metric Description Min Max Avg Device “GeForce MX150 (0)” Kernel: kernel(int const *, int const , int, int) 1 l2_read_transactions L2 Read Transactions 262201 262201 262201

==25701== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “GeForce MX150 (0)”
Kernel: kernel(int const , int const , int*, int)
1 l2_read_transactions L2 Read Transactions 262201 262201 262201