unexpected Caching behavior on cudaMallocHost allocated memory for TX2

humuru · November 22, 2018, 2:26pm

I see degraded performance on the A57 and Denver CPUs when I run CPU code on memory allocated with cudaHostAlloc/cudaMallocHost (but not cudaMallocManaged). Specifically, memory allocated with cudaHostAlloc (but without the cudaHostAllocWriteCombined) behaves exactly the same on TX2 as it does when allocated with the cudaHostAllocWriteCombined flag. That is, it bypasses the CPU cache, which obviously degrades CPU peformance on many workloads. When profiling my application, I see it indeed spend a lot more time on memory access instructions. I do not know how to verify from user space whether a page has been marked as WriteCombined

I do not see this behavior on Intel CPUs or AGX Xavier. I wrote a small test program that allocates memory with various methods, runs the same function on it (single-threaded), and measures the time it takes.

The baseline results on a amd64 host:

Linux/x86_64, kernel 4.4.0-137-generic [#163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018]
CUDA Driver / Runtime 9.0 / 9.0
                  memory performance:       read,      write, read-write
    system malloc memory performance:     922.00,    1992.00,    1743.00
cudaMallocManaged memory performance:    1464.00,    1706.00,    1978.00
   cudaMallocHost memory performance:    1304.00,    1669.00,    1719.00
cudaWriteCombined memory performance:  185568.00,    1127.00,  309549.00

as performance, the table lists time taken, smaller numbers are better

The results are as expected, there is no performance difference between system allocated memory and cudaMallocHost allocated memory, but there is a significant difference when using memory allocated with cudaHostAlloc(…, cudaHostAllocWriteCombined).

I ran the same program with fixed CPU clocks on the A57-cluster of a TX2:

Linux/aarch64, kernel 4.4.38-tegra [#1 SMP PREEMPT Thu May 17 00:15:19 PDT 2018]
CUDA version 9.0 (driver 9.0)
                  memory performance:       read,      write, read-write
    system malloc memory performance:    2517.00,    2530.00,    3904.00
cudaMallocManaged memory performance:    3231.00,    2535.00,    3875.00
   cudaMallocHost memory performance:   30167.00,    2518.00,  710805.00
cudaWriteCombined memory performance:   30045.00,    2517.00,  710500.00

These results are unexpected, cudaMallocHost memory behaves exactly as memory allocated with cudaHostAlloc(…, cudaHostAllocWriteCombined). L4T 28.1 (CUDA 8.0) behaves no different from L4t 28.2.1 (CUDA 9.0).
testCudaHostMem.cpp (3.37 KB)

humuru · November 22, 2018, 3:05pm

The performance degradation is different for the Denver cores, but you can still see that cudaMallocHost behaves like write combined memory, not like ordinary system memory

Linux/aarch64, kernel 4.4.38-tegra [#1 SMP PREEMPT Thu May 17 00:15:19 PDT 2018]
CUDA version 9.0 (driver 9.0)
                  memory performance:       read,      write, read-write
    system malloc memory performance:    2538.00,    2517.00,    3929.00
cudaMallocManaged memory performance:    3268.00,    2544.00,    8933.00
   cudaMallocHost memory performance:    5100.00,    2041.00,  969501.00
cudaWriteCombined memory performance:    3179.00,    1887.00,  970239.00

humuru · November 22, 2018, 3:08pm

I can see the same behavior on TK1:

Linux/armv7l, kernel 3.10.40 [#22 SMP PREEMPT Fri Sep 11 18:31:28 CST 2015]
CUDA Driver Version / Runtime Version          6.5 / 6.5
                  memory performance:       read,      write, read-write
    system malloc memory performance:    3378.00,    4132.00,    7833.00
cudaMallocManaged memory performance:    3864.00,    3648.00,    6148.00
   cudaMallocHost memory performance:   58155.00,   14893.00,  552427.00
cudaWriteCombined memory performance:   55659.00,   13386.00,  542529.00

but not an AGX Xavier:

Linux/aarch64, kernel 4.9.108-tegra [#1 SMP PREEMPT Fri Sep 28 22:03:31 PDT 2018]
  CUDA Driver Version / Runtime Version          10.0 / 10.0
                  memory performance:       read,      write, read-write
    system malloc memory performance:    1732.00,    1704.00,    2351.00
cudaMallocManaged memory performance:    1808.00,    1741.00,    1453.00
   cudaMallocHost memory performance:    1727.00,    1737.00,    1456.00
cudaWriteCombined memory performance:   16934.00,    1762.00,  851765.00

The source code of my test program is attached for reference
testCudaHostMem.cpp (3.37 KB)

AastaLLL · November 23, 2018, 3:07am

Hi,

Thanks for your didicated experiment. It’s recommended to read this document first:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Memory Type        | CPU                                                              | iGPU 
--
Pinned host memory | Uncached where compute capability is less than 7.2.              | Uncached
                   | Cached where compute capability is greater than or equal to 7.2. |

This should be enough to explain the behavior you saw.
Thanks

humuru · November 23, 2018, 11:41am

Hi,

thank you for your reply, that does indeed explain the behavior I see.
It is good to see this documented at all, but it is well hidden, since TX2 owners know that CUDA 10.0 is not officially supported on their platform, and previous CUDA versions do not have a “CUDA for Tegra” application note in their documentation.

Topic		Replies	Views
cudaMallocHost caching behavior CUDA Programming and Performance	4	725	March 1, 2019
CUDA memory performance Jetson TK1	3	1124	October 18, 2021
cudaMallocManaged() clarification needed CUDA Programming and Performance	5	11223	November 20, 2018
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6125	February 22, 2016
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	605	March 7, 2019
Why the access speed of memory allocated by cudaMallocHost is so slow? Jetson TX2 cuda	8	704	October 18, 2021
Unified Memory Access Performance of Arrays of Structures Problem on Jetson TX2 Jetson TX2 cuda	5	629	October 18, 2021
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1731	October 18, 2021
CudaMalloc slow on DriveAGX DRIVE AGX Xavier General	8	631	October 12, 2021
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2012	February 15, 2016

unexpected Caching behavior on cudaMallocHost allocated memory for TX2

Related topics