Unified Memory Access Performance of Arrays of Structures Problem on Jetson TX2

lswbblue · April 22, 2020, 10:36am

Hi, I try to use unified memory on TX2 to save cudaMemcpy time. For basic variable type, the access performance of unified memory after KE is well.
When I try unified memory for arrays of structures, the access performance of the same postprocessing loop is much worse than explicit memory after D2H.

The test demo can be seen at GitHub - Irwin-Liu/UnifiedMemoryDemo: CUDA unified memory test on Jetson.

The access on CPU of postprocessing is like
for (int h = 0; h < row; h++) {
for (int w = 0; w < col; w++) {
int index = h * col + w;
if (img_RGB[index].a > 0) {
result_RGB.push_back(img_RGB[index]);
}
}
}

I run this demo on TX2 can get below results
$ ./ExplicitMemory
kernel and data transfer time: 14.658 ms
postprocess time: 3.183 ms

and

$ ./UnifiedMemory
RGB kernel time: 3.484 ms float kernel time: 0.878 ms
RGB postprocess time: 9.669 ms float postprocess time: 7.063 ms float postprocess time without data copy: 3.301 ms

Whatever CUDA 9.0 or CUDA 10.0 environment on TX2, above similar results can be gotten.

It is strange the same code get huge different performance. Though I found a way to get the performance of explicit memory by using unified memory, I still want to know is there any wrong of origin access of unified memory.

lswbblue · April 26, 2020, 10:14am

Hi, recently I found a interesting result for above demo.

Above results were gotten at Max-N mode, where 2 Denver2 and 4 ARM-A57 cores were all used.

But when mode was set to Max-P ARM, where only 4 ARM-A57 were used, the results changed to
$ ./ExplicitMemory
kernel and data transfer time: 7.877 ms
postprocess time: 1.892 ms

and

RGB kernel time: 0.666 ms float kernel time: 0.582 ms
RGB postprocess time: 4.885 ms float postprocess time: 3.555 ms float postprocess time without data copy: 1.467 ms

All postprocessings are faster at Max-P ARM mode. So is the reason that the memory access performance on TX2 of Denver 2 core are slower than ARM A57 core?

Another question is, whatever mode it is, the memory access performance of arrays of structures is slower. I understand in some situation, the reason is that the cache hits ratio is lower. But it still can not explain the difference performance for same postprocessing code of using explicit memory and unified memory.

AastaLLL · May 14, 2020, 2:12am

Hi,

Sorry for the late update.

Here are some initial thought for this issue.
In general, memory speed is cudaMalloc > unified memory > pinned memory.
The overhead is cudaMalloc (fully memory copy) > unified memory (buffer synchronized) > pinned memory.

We are going to reproduce this issue in our side.
May I know which JetPack version do you use? v43 or v44?

Thanks.

lswbblue · May 14, 2020, 6:52am

Hi,

Thanks for reply!

I tried this demo on 2 TX2s, one is cuda 10.0 of v43 certainly, and another is cuda 9.0 of v42 uncertainly (the system was installed more than half year ago). The results seemed no difference.

What I wonder is that can I say that though CPU and GPU use same memory on TX2 physically, the explicit memory used by CPU is fastest than pinned memory, cudaMalloc and unified memory that can be used by both CPU and GPU?

If so, the questions of this demo would be: 1. why performance of access arrays of structures of unified memory is much worse than float; 2. Is the truth that memory access performance of Denver 2 core is slower than ARM A57 core?

For 2nd question, I also see similar performance difference by many other codes that run on TX2 with different modes.

AastaLLL · May 27, 2020, 9:36am

Hi,

For more memory information, please check our document here:

cudaMalloc can only be accessed via GPU. That’t why the overhead is the memory copy time.
unified memory can be accessed by either CPU or GPU but cannot support concurrent access.

1. cudaMalloc is faster than the unified memory.
Since it’s only accessible via GPU, the memory can be allocated some where near to GPU.

However, the sample doesn’t copy the memory from CPU to GPU and copy the result back.
The overhead doesn’t take into consideration.

2. Suppose the memory access time should be similar.
You can give it a try directly.

Thanks.

Topic		Replies	Views
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1712	October 18, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11461	August 20, 2018
The memory sharing between cpu and gpu in Jetson TX2 Jetson TX2	6	7005	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
Different types of memory transfer change the execution time of kernel on Tegra x1 Jetson TX1	5	853	October 18, 2021
Unified Memory Access using Jetson TX2 Jetson TX2	5	2324	October 18, 2021
Why the access speed of memory allocated by cudaMallocHost is so slow? Jetson TX2 cuda	8	693	October 18, 2021
How to disable zero-copy on TX1? Jetson TX1	4	758	October 18, 2021
Why does it take longer for a program to use Unified Memory than not to use Uuified Memoery? Jetson AGX Xavier cuda	2	281	October 18, 2021
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3234	August 31, 2018

Unified Memory Access Performance of Arrays of Structures Problem on Jetson TX2

Related topics