I see degraded performance on the A57 and Denver CPUs when I run CPU code on memory allocated with cudaHostAlloc/cudaMallocHost (but not cudaMallocManaged). Specifically, memory allocated with cudaHostAlloc (but without the cudaHostAllocWriteCombined) behaves exactly the same on TX2 as it does when allocated with the cudaHostAllocWriteCombined flag. That is, it bypasses the CPU cache, which obviously degrades CPU peformance on many workloads. When profiling my application, I see it indeed spend a lot more time on memory access instructions. I do not know how to verify from user space whether a page has been marked as WriteCombined
I do not see this behavior on Intel CPUs or AGX Xavier. I wrote a small test program that allocates memory with various methods, runs the same function on it (single-threaded), and measures the time it takes.
The baseline results on a amd64 host:
Linux/x86_64, kernel 4.4.0-137-generic [#163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018] CUDA Driver / Runtime 9.0 / 9.0 memory performance: read, write, read-write system malloc memory performance: 922.00, 1992.00, 1743.00 cudaMallocManaged memory performance: 1464.00, 1706.00, 1978.00 cudaMallocHost memory performance: 1304.00, 1669.00, 1719.00 cudaWriteCombined memory performance: 185568.00, 1127.00, 309549.00
as performance, the table lists time taken, smaller numbers are better
The results are as expected, there is no performance difference between system allocated memory and cudaMallocHost allocated memory, but there is a significant difference when using memory allocated with cudaHostAlloc(…, cudaHostAllocWriteCombined).
I ran the same program with fixed CPU clocks on the A57-cluster of a TX2:
Linux/aarch64, kernel 4.4.38-tegra [#1 SMP PREEMPT Thu May 17 00:15:19 PDT 2018] CUDA version 9.0 (driver 9.0) memory performance: read, write, read-write system malloc memory performance: 2517.00, 2530.00, 3904.00 cudaMallocManaged memory performance: 3231.00, 2535.00, 3875.00 cudaMallocHost memory performance: 30167.00, 2518.00, 710805.00 cudaWriteCombined memory performance: 30045.00, 2517.00, 710500.00
These results are unexpected, cudaMallocHost memory behaves exactly as memory allocated with cudaHostAlloc(…, cudaHostAllocWriteCombined). L4T 28.1 (CUDA 8.0) behaves no different from L4t 28.2.1 (CUDA 9.0).
testCudaHostMem.cpp (3.37 KB)