unexpected Caching behavior on cudaMallocHost allocated memory for TX2

I see degraded performance on the A57 and Denver CPUs when I run CPU code on memory allocated with cudaHostAlloc/cudaMallocHost (but not cudaMallocManaged). Specifically, memory allocated with cudaHostAlloc (but without the cudaHostAllocWriteCombined) behaves exactly the same on TX2 as it does when allocated with the cudaHostAllocWriteCombined flag. That is, it bypasses the CPU cache, which obviously degrades CPU peformance on many workloads. When profiling my application, I see it indeed spend a lot more time on memory access instructions. I do not know how to verify from user space whether a page has been marked as WriteCombined

I do not see this behavior on Intel CPUs or AGX Xavier. I wrote a small test program that allocates memory with various methods, runs the same function on it (single-threaded), and measures the time it takes.

The baseline results on a amd64 host:

Linux/x86_64, kernel 4.4.0-137-generic [#163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018]
CUDA Driver / Runtime 9.0 / 9.0
                  memory performance:       read,      write, read-write
    system malloc memory performance:     922.00,    1992.00,    1743.00
cudaMallocManaged memory performance:    1464.00,    1706.00,    1978.00
   cudaMallocHost memory performance:    1304.00,    1669.00,    1719.00
cudaWriteCombined memory performance:  185568.00,    1127.00,  309549.00

as performance, the table lists time taken, smaller numbers are better

The results are as expected, there is no performance difference between system allocated memory and cudaMallocHost allocated memory, but there is a significant difference when using memory allocated with cudaHostAlloc(…, cudaHostAllocWriteCombined).

I ran the same program with fixed CPU clocks on the A57-cluster of a TX2:

Linux/aarch64, kernel 4.4.38-tegra [#1 SMP PREEMPT Thu May 17 00:15:19 PDT 2018]
CUDA version 9.0 (driver 9.0)
                  memory performance:       read,      write, read-write
    system malloc memory performance:    2517.00,    2530.00,    3904.00
cudaMallocManaged memory performance:    3231.00,    2535.00,    3875.00
   cudaMallocHost memory performance:   30167.00,    2518.00,  710805.00
cudaWriteCombined memory performance:   30045.00,    2517.00,  710500.00

These results are unexpected, cudaMallocHost memory behaves exactly as memory allocated with cudaHostAlloc(…, cudaHostAllocWriteCombined). L4T 28.1 (CUDA 8.0) behaves no different from L4t 28.2.1 (CUDA 9.0).
testCudaHostMem.cpp (3.37 KB)

The performance degradation is different for the Denver cores, but you can still see that cudaMallocHost behaves like write combined memory, not like ordinary system memory

Linux/aarch64, kernel 4.4.38-tegra [#1 SMP PREEMPT Thu May 17 00:15:19 PDT 2018]
CUDA version 9.0 (driver 9.0)
                  memory performance:       read,      write, read-write
    system malloc memory performance:    2538.00,    2517.00,    3929.00
cudaMallocManaged memory performance:    3268.00,    2544.00,    8933.00
   cudaMallocHost memory performance:    5100.00,    2041.00,  969501.00
cudaWriteCombined memory performance:    3179.00,    1887.00,  970239.00

I can see the same behavior on TK1:

Linux/armv7l, kernel 3.10.40 [#22 SMP PREEMPT Fri Sep 11 18:31:28 CST 2015]
CUDA Driver Version / Runtime Version          6.5 / 6.5
                  memory performance:       read,      write, read-write
    system malloc memory performance:    3378.00,    4132.00,    7833.00
cudaMallocManaged memory performance:    3864.00,    3648.00,    6148.00
   cudaMallocHost memory performance:   58155.00,   14893.00,  552427.00
cudaWriteCombined memory performance:   55659.00,   13386.00,  542529.00

but not an AGX Xavier:

Linux/aarch64, kernel 4.9.108-tegra [#1 SMP PREEMPT Fri Sep 28 22:03:31 PDT 2018]
  CUDA Driver Version / Runtime Version          10.0 / 10.0
                  memory performance:       read,      write, read-write
    system malloc memory performance:    1732.00,    1704.00,    2351.00
cudaMallocManaged memory performance:    1808.00,    1741.00,    1453.00
   cudaMallocHost memory performance:    1727.00,    1737.00,    1456.00
cudaWriteCombined memory performance:   16934.00,    1762.00,  851765.00

The source code of my test program is attached for reference
testCudaHostMem.cpp (3.37 KB)

Hi,

Thanks for your didicated experiment. It’s recommended to read this document first:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Memory Type        | CPU                                                              | iGPU 
--
Pinned host memory | Uncached where compute capability is less than 7.2.              | Uncached
                   | Cached where compute capability is greater than or equal to 7.2. |

This should be enough to explain the behavior you saw.
Thanks

Hi,

thank you for your reply, that does indeed explain the behavior I see.
It is good to see this documented at all, but it is well hidden, since TX2 owners know that CUDA 10.0 is not officially supported on their platform, and previous CUDA versions do not have a “CUDA for Tegra” application note in their documentation.