How should I analyze test results for UNIFIED_MEMORY in CUDA 11.7?

LeoKim · October 10, 2023, 7:32am

Our test machine has an A-100 80GB GPU. The test results on CUDA 11.7 are shown below. How should I analyze this?

Allocation size in bytes 65536
UNIFIED_MEMORY_COUNTER [ 1696905450031083169 1696905450031085793 ] kind=BYTES_TRANSFER_HTOD value=4096 src 0 dst 0
UNIFIED_MEMORY_COUNTER [ 1696905450031085793 1696905450031093473 ] kind=BYTES_TRANSFER_HTOD value=61440 src 0 dst 0
UNIFIED_MEMORY_COUNTER [ 1696905450032934643 1696905450032936403 ] kind=BYTES_TRANSFER_DTOH value=4096 src 0 dst 0
UNIFIED_MEMORY_COUNTER [ 1696905450032936403 1696905450032942451 ] kind=BYTES_TRANSFER_DTOH value=61440 src 0 dst 0

Here are the results of my tests with the utilities provided by CUDA 11.1.

GPU Device 0: "Ampere" with compute capability 8.0
Running ..................................
Overall Time For matrixMultiplyPerf 
Printing Average of 20 measurements in (ms)
Size_KB  UMhint UMhntAs  UMeasy   0Copy MemCopy CpAsync CpHpglk CpPglAs
4         1.033   1.335   0.522   0.061   0.186   0.129   0.163   0.091
16        1.001   1.602   0.993   0.138   0.277   0.228   0.275   0.226
64        1.612   2.230   1.756   0.329   0.525   0.473   0.423   0.321
256       3.004   3.265   4.934   0.967   1.448   1.442   0.977   0.958
1024      9.676   8.841  12.018   5.704   5.191   4.765   3.739   3.279
4096     32.527  24.978  44.869  28.896  18.978  18.002  14.408  12.870
16384   120.914  90.225 189.213 195.483  82.369  80.798  54.931  53.995

mjain · October 16, 2023, 7:04am

Hi LeoKim, it would help us understand your queries better if you tell us why are you using these two samples? How are the two samples related? Are you trying to correlate the output of the two samples in some way?

Based on our understanding of the results, the first output appears to be generated using the CUPTI sample unified_memory. If yes, this sample shows how to collect the information about page transfers for unified memory application using CUPTI APIs. For ex - first row shows that 4096 bytes located at the address 7fe1de000000 are transferred from the host to the device memory in 2688 nsec. CUPTI activity record structure for unified memory https://docs.nvidia.com/cupti/annotated.html#structCUpti__ActivityUnifiedMemoryCounter2 can be used to interpret the result.

Second output appears to be from the CUDA sample UnifiedMemoryPerf. Please refer to the page for details about the sample - https://github.com/NVIDIA/cuda-samples/tree/master/Samples/6_Performance/UnifiedMemoryPerf.

mjain · November 8, 2023, 11:58am

Hi LeoKim,

Does my previous post help you in analyzing the unified memory results? If yes, can you please close the topic? If you have any further queries, pls let us know.

veraj · November 29, 2023, 4:50am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.