I used the following command to measure some L2 metric of TitanV for MatrixMul sample.
CUDA_VISIBLE_DEVICES=0 ~/cuda-10.1.168/bin/nvprof --quiet \
--metrics l2_write_throughput,l2_read_throughput,l2_utilization \
-f -o titanv.l2.nvvp ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
The result, as can be seen in the picture is quite strange.
reads = 4,092,337.528GB/s
writes = 4,092,337.528 GB/s
utilization = Max
https://pasteboard.co/IG3qUxY.png
what results do you get if you just print out the data from nvprof, instead of importing it into nvvp?
CUDA_VISIBLE_DEVICES=0 ~/cuda-10.1.168/bin/nvprof --metrics l2_write_throughput,l2_read_throughput,l2_utilization ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
(not interested in all the lines that start with Replaying, just the results after that). I get this on Tesla V100:
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16689== Profiling application: /usr/local/cuda/samples/bin/x86_64/linux/release/matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
==16689== Profiling result:
==16689== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla V100-PCIE-32GB (0)"
Kernel: void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
301 l2_write_throughput L2 Throughput (Writes) 3.0157GB/s 3.5684GB/s 3.0884GB/s
301 l2_read_throughput L2 Throughput (Reads) 382.62GB/s 395.18GB/s 392.67GB/s
301 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
$
Please see below. Still problematic
==7238== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "TITAN V (0)"
Kernel: void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
301 l2_write_throughput L2 Throughput (Writes) 4e+06GB/s 4e+06GB/s 4e+06GB/s
301 l2_read_throughput L2 Throughput (Reads) 4e+06GB/s 4e+06GB/s 4e+06GB/s
301 l2_utilization L2 Cache Utilization Max (10) Max (10) Max (10)
can you try switching to the latest CUDA 10.1.243
I will. Meantime, let me ask my real question based on v100 results.
The L2 utilization is said to be low. I would like to know what is the border between low(1) and low(2)? Currently, if I write
Max throughput=395GB/S*10=3950 GB/S,
Then I won’t be sure that 600GB/S is still classified as low(1) or not.
Moreover, is utilization based on max(read,write) or read+write or something else?
I didn’t see a document describing that. Any comment?