Weird L2 values for MatrixMul sample

I used the following command to measure some L2 metric of TitanV for MatrixMul sample.

CUDA_VISIBLE_DEVICES=0 ~/cuda-10.1.168/bin/nvprof --quiet \
 --metrics l2_write_throughput,l2_read_throughput,l2_utilization \
-f -o titanv.l2.nvvp ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048

The result, as can be seen in the picture is quite strange.

reads = 4,092,337.528GB/s
writes = 4,092,337.528 GB/s
utilization = Max

https://pasteboard.co/IG3qUxY.png

what results do you get if you just print out the data from nvprof, instead of importing it into nvvp?

CUDA_VISIBLE_DEVICES=0 ~/cuda-10.1.168/bin/nvprof --metrics l2_write_throughput,l2_read_throughput,l2_utilization ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048

(not interested in all the lines that start with Replaying, just the results after that). I get this on Tesla V100:

Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16689== Profiling application: /usr/local/cuda/samples/bin/x86_64/linux/release/matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
==16689== Profiling result:
==16689== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
        301                       l2_write_throughput                    L2 Throughput (Writes)  3.0157GB/s  3.5684GB/s  3.0884GB/s
        301                        l2_read_throughput                     L2 Throughput (Reads)  382.62GB/s  395.18GB/s  392.67GB/s
        301                            l2_utilization                      L2 Cache Utilization     Low (1)     Low (1)     Low (1)
$

Please see below. Still problematic

==7238== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "TITAN V (0)"
    Kernel: void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
        301                       l2_write_throughput                    L2 Throughput (Writes)   4e+06GB/s   4e+06GB/s   4e+06GB/s
        301                        l2_read_throughput                     L2 Throughput (Reads)   4e+06GB/s   4e+06GB/s   4e+06GB/s
        301                            l2_utilization                      L2 Cache Utilization    Max (10)    Max (10)    Max (10)

can you try switching to the latest CUDA 10.1.243

I will. Meantime, let me ask my real question based on v100 results.

The L2 utilization is said to be low. I would like to know what is the border between low(1) and low(2)? Currently, if I write
Max throughput=395GB/S*10=3950 GB/S,
Then I won’t be sure that 600GB/S is still classified as low(1) or not.

Moreover, is utilization based on max(read,write) or read+write or something else?

I didn’t see a document describing that. Any comment?