Hi, I try to use unified memory on TX2 to save cudaMemcpy time. For basic variable type, the access performance of unified memory after KE is well.
When I try unified memory for arrays of structures, the access performance of the same postprocessing loop is much worse than explicit memory after D2H.
The test demo can be seen at GitHub - Irwin-Liu/UnifiedMemoryDemo: CUDA unified memory test on Jetson.
The access on CPU of postprocessing is like
for (int h = 0; h < row; h++) {
for (int w = 0; w < col; w++) {
int index = h * col + w;
if (img_RGB[index].a > 0) {
result_RGB.push_back(img_RGB[index]);
}
}
}
I run this demo on TX2 can get below results
$ ./ExplicitMemory
kernel and data transfer time: 14.658 ms
postprocess time: 3.183 ms
and
$ ./UnifiedMemory
RGB kernel time: 3.484 ms float kernel time: 0.878 ms
RGB postprocess time: 9.669 ms float postprocess time: 7.063 ms float postprocess time without data copy: 3.301 ms
Whatever CUDA 9.0 or CUDA 10.0 environment on TX2, above similar results can be gotten.
It is strange the same code get huge different performance. Though I found a way to get the performance of explicit memory by using unified memory, I still want to know is there any wrong of origin access of unified memory.
Hi, recently I found a interesting result for above demo.
Above results were gotten at Max-N mode, where 2 Denver2 and 4 ARM-A57 cores were all used.
But when mode was set to Max-P ARM, where only 4 ARM-A57 were used, the results changed to
$ ./ExplicitMemory
kernel and data transfer time: 7.877 ms
postprocess time: 1.892 ms
and
RGB kernel time: 0.666 ms float kernel time: 0.582 ms
RGB postprocess time: 4.885 ms float postprocess time: 3.555 ms float postprocess time without data copy: 1.467 ms
All postprocessings are faster at Max-P ARM mode. So is the reason that the memory access performance on TX2 of Denver 2 core are slower than ARM A57 core?
Another question is, whatever mode it is, the memory access performance of arrays of structures is slower. I understand in some situation, the reason is that the cache hits ratio is lower. But it still can not explain the difference performance for same postprocessing code of using explicit memory and unified memory.
Hi,
Sorry for the late update.
Here are some initial thought for this issue.
In general, memory speed is cudaMalloc > unified memory > pinned memory.
The overhead is cudaMalloc (fully memory copy) > unified memory (buffer synchronized) > pinned memory.
We are going to reproduce this issue in our side.
May I know which JetPack version do you use? v43 or v44?
Thanks.
Hi,
Thanks for reply!
I tried this demo on 2 TX2s, one is cuda 10.0 of v43 certainly, and another is cuda 9.0 of v42 uncertainly (the system was installed more than half year ago). The results seemed no difference.
What I wonder is that can I say that though CPU and GPU use same memory on TX2 physically, the explicit memory used by CPU is fastest than pinned memory, cudaMalloc and unified memory that can be used by both CPU and GPU?
If so, the questions of this demo would be: 1. why performance of access arrays of structures of unified memory is much worse than float; 2. Is the truth that memory access performance of Denver 2 core is slower than ARM A57 core?
For 2nd question, I also see similar performance difference by many other codes that run on TX2 with different modes.
Hi,
For more memory information, please check our document here:
cudaMalloc can only be accessed via GPU. That’t why the overhead is the memory copy time.
unified memory can be accessed by either CPU or GPU but cannot support concurrent access.
1. cudaMalloc is faster than the unified memory.
Since it’s only accessible via GPU, the memory can be allocated some where near to GPU.
However, the sample doesn’t copy the memory from CPU to GPU and copy the result back.
The overhead doesn’t take into consideration.
2. Suppose the memory access time should be similar.
You can give it a try directly.
Thanks.