Accessing Unified Memory from ARM is very slow

I am experimenting a simple image processing application with CUDA.

However, there is an issue that the access time from ARM to Unified Memory is very slow.
(Moreover, using CudaStream, it is more slower.)

The image processing is as follows.
target: 100M pixel images
(Proc-1) Binarization
(Proc-2) Create a coordinate list non-zero pixels of binarization image

Experimental results are as follows.

The Experiment No. 1 uses ARM for both the proc-1 and the proc-2.
The Experiment No. 2 uses CUDA and CudaStream for the proc-1, and uses ARM for the proc-2.
(note: The reason for using CudaStream is this image processing will be run in parallel by multiple threads.)

As a result, Experiment 2 was slow even though I used CUDA.
So, I investigated it.
The total time of Experiment No. 2 is 410ms and the breakdown of the time is as follows.
Proc-1 is 10ms with CUDA
Proc-2 is 400ms with ARM

I wonder why the Experiment No. 1 takes 260ms when two processes (1) and (2) are processed by ARM, while the Experiment No. 2 takes 400ms only for the part processed by ARM (2).
I checked for programming mistakes, but there are no mistakes.

So I doubted Unified Memory and CudaStream.
In Experiment No. 3, I returned the process of (1) from CUDA to ARM while using Unified Memory and CudaStream.
As a result, as expected, it became even slower. (1320ms)

In Experiment No. 4, we stopped using CudaStream while using Unified Memory.
As a result, it became 570ms.

After that, we don’t use Unified Memory for Experiment No.4.
As a result, it became the same result of the Experiment No.1.

Due to some image processing algorithm, it is essential to mix CUDA and ARM, so I have to use Unified Memory. Also, because I want to run the same image processing in parallel, CudaStream is also essential.

Please support us by telling us how to solve this problem.
Are there any points to be aware of regarding API usage and compilation options?


Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, could you try cudaStreamAttachMemAsync to prefetch the data to reduce the latency.
Here is a document for picking memory type on the Jetson platform for your reference: