While investigating a performance degradation of an application ported from CUDA 10.2 to CUDA 11.4 on a Jetson AGX Xavier I realized that the problem might be in some difference in memory management between the two versions. So, I tried the UnifiedMemoryPerf sample on two Jetsons AGX Xavier, one with L4T R32.5.2 (CUDA 10.2) and another with R35.3.1 (CUDA 11.4), both with MAXN power mode and Jetson Clocks enabled, and these are the results:
CUDA 10.2
Overall Time For matrixMultiplyPerf
Printing Average of 20 measurements in (ms)
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
4 0.120 0.443 0.092 0.039 0.101 0.088 0.119 0.062
16 0.110 0.407 0.121 0.044 0.106 0.103 0.117 0.088
64 0.212 0.540 0.322 0.109 0.191 0.122 0.194 0.100
256 0.431 0.859 0.962 0.306 0.494 0.409 0.446 0.318
1024 1.551 2.032 4.104 1.335 1.935 1.742 1.639 1.412
4096 8.389 8.807 17.868 7.943 9.693 9.866 8.818 8.386
16384 55.121 55.245 91.032 54.522 61.219 62.065 57.337 56.602
CUDA 11.4
Overall Time For matrixMultiplyPerf
Printing Average of 20 measurements in (ms)
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
4 0.169 0.479 0.111 0.039 0.109 0.084 0.137 0.063
16 0.177 0.514 0.152 0.043 0.130 0.134 0.133 0.100
64 0.359 0.705 0.338 0.098 0.161 0.148 0.151 0.117
256 1.088 1.501 1.080 0.282 0.455 0.409 0.443 0.316
1024 4.300 4.710 4.256 1.342 1.927 1.788 1.600 1.431
4096 18.966 19.380 18.949 8.015 9.698 9.552 8.642 8.265
16384 97.018 97.663 96.769 54.051 59.760 60.218 56.289 56.217
The first three columns seem to suggest that Unified Memory can be twice as slow on CUDA 11.4 than on CUDA 10.2. Are these results expected?