Our test machine has an A-100 80GB GPU. The test results on CUDA 11.7 are shown below. How should I analyze this?
Allocation size in bytes 65536
UNIFIED_MEMORY_COUNTER [ 1696905450031083169 1696905450031085793 ] kind=BYTES_TRANSFER_HTOD value=4096 src 0 dst 0
UNIFIED_MEMORY_COUNTER [ 1696905450031085793 1696905450031093473 ] kind=BYTES_TRANSFER_HTOD value=61440 src 0 dst 0
UNIFIED_MEMORY_COUNTER [ 1696905450032934643 1696905450032936403 ] kind=BYTES_TRANSFER_DTOH value=4096 src 0 dst 0
UNIFIED_MEMORY_COUNTER [ 1696905450032936403 1696905450032942451 ] kind=BYTES_TRANSFER_DTOH value=61440 src 0 dst 0
Here are the results of my tests with the utilities provided by CUDA 11.1.
GPU Device 0: "Ampere" with compute capability 8.0
Running ..................................
Overall Time For matrixMultiplyPerf
Printing Average of 20 measurements in (ms)
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
4 1.033 1.335 0.522 0.061 0.186 0.129 0.163 0.091
16 1.001 1.602 0.993 0.138 0.277 0.228 0.275 0.226
64 1.612 2.230 1.756 0.329 0.525 0.473 0.423 0.321
256 3.004 3.265 4.934 0.967 1.448 1.442 0.977 0.958
1024 9.676 8.841 12.018 5.704 5.191 4.765 3.739 3.279
4096 32.527 24.978 44.869 28.896 18.978 18.002 14.408 12.870
16384 120.914 90.225 189.213 195.483 82.369 80.798 54.931 53.995