Performance of Unified Memory in CUDA 11.4 v CUDA 10.2

While investigating a performance degradation of an application ported from CUDA 10.2 to CUDA 11.4 on a Jetson AGX Xavier I realized that the problem might be in some difference in memory management between the two versions. So, I tried the UnifiedMemoryPerf sample on two Jetsons AGX Xavier, one with L4T R32.5.2 (CUDA 10.2) and another with R35.3.1 (CUDA 11.4), both with MAXN power mode and Jetson Clocks enabled, and these are the results:

CUDA 10.2

Overall Time For matrixMultiplyPerf 

Printing Average of 20 measurements in (ms)
Size_KB  UMhint UMhntAs  UMeasy   0Copy MemCopy CpAsync CpHpglk CpPglAs
4         0.120   0.443   0.092   0.039   0.101   0.088   0.119   0.062
16        0.110   0.407   0.121   0.044   0.106   0.103   0.117   0.088
64        0.212   0.540   0.322   0.109   0.191   0.122   0.194   0.100
256       0.431   0.859   0.962   0.306   0.494   0.409   0.446   0.318
1024      1.551   2.032   4.104   1.335   1.935   1.742   1.639   1.412
4096      8.389   8.807  17.868   7.943   9.693   9.866   8.818   8.386
16384    55.121  55.245  91.032  54.522  61.219  62.065  57.337  56.602

CUDA 11.4

Overall Time For matrixMultiplyPerf 

Printing Average of 20 measurements in (ms)
Size_KB  UMhint UMhntAs  UMeasy   0Copy MemCopy CpAsync CpHpglk CpPglAs
4         0.169   0.479   0.111   0.039   0.109   0.084   0.137   0.063
16        0.177   0.514   0.152   0.043   0.130   0.134   0.133   0.100
64        0.359   0.705   0.338   0.098   0.161   0.148   0.151   0.117
256       1.088   1.501   1.080   0.282   0.455   0.409   0.443   0.316
1024      4.300   4.710   4.256   1.342   1.927   1.788   1.600   1.431
4096     18.966  19.380  18.949   8.015   9.698   9.552   8.642   8.265
16384    97.018  97.663  96.769  54.051  59.760  60.218  56.289  56.217

The first three columns seem to suggest that Unified Memory can be twice as slow on CUDA 11.4 than on CUDA 10.2. Are these results expected?

Hi,

Thanks for reporting this.
We need to reproduce this in our environment first and check with our internal team.

Will share more info with you later.
Thanks.