CPU operation is very slow on memory allocated by cudaMallocHost

heyworld · October 6, 2018, 5:43am

The speed of copying data between GPU and CPU is faster when I use cudaMallocHost(rather than malloc) to allocate host memory(let’s say hostMem).

However CPU operation on hostMem is much slower, is there a method I can allocate memory that could make copying faster but doesn’t slow CPU operation?

Thanks in advance.

AastaLLL · October 8, 2018, 6:50am

Hi,

Have you maximized the device performance first?

sudo ./jetson_clocks.sh

Thanks.

heyworld · October 8, 2018, 7:53am

Yes, I did.
I found from some other topics that pinned memory(allocated by cudaMallocHost) didn’t use cache which is the reason why CPU operation is slow on pinned memory.

AastaLLL · October 15, 2018, 10:45am

Hi,

YES.

It’s recommended to use unified memory for Jetson.
You can check this document for the memory management on Jetson:
[url]https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management[/url]

Thanks.

heyworld · October 15, 2018, 6:18pm

I am working on unified memory, but someone said it’s not supported on TX2, some documents said it’s supported on TX2, do you have definite idea about this?

Honey_Patouceul · October 15, 2018, 6:55pm

It is supported.

heyworld · October 15, 2018, 9:52pm

Thanks, I have implemented unified memory, but encountered Bus error(core dump), I read some articles saying about exclusive access of GPU and CPU.

My program has multi-tread, GPU and CPU will need to access different part of the unified memory, is this reason I have Bus error(core dump)? If yes, do I have alternative that can allow me to use unified memory and multi-thread?

Honey_Patouceul · October 15, 2018, 10:26pm

Unified memory means that you can access it from CPU or GPU without copying, but accessing it at the same time from both is usually a bad idea. What would you expect to be the result if writing from both sides at the same time ?
You may rather allocate several buffer with unified memory.
For example you may receive data from camera on CPU side and store it into buffer 1, while GPU is processing buffer2 and buffer3 is displayed (from GPU or CPU side).
Once frame has been processed on each, you would then receive data into buffer3, GPU would process buffer1 (where the previously acquired is), and display would read buffer2 (previously processed by GPU), and so on. This is just a basic example, it may be more complex depending on your use case.

heyworld · October 15, 2018, 11:02pm

Thanks Honey_Patouceul.

Current situation is that I do need to access unified memory by GPU and CPU at the same time, like I logically partition the unified memory to 10 parts, GPU and CPU will access unified memory at the same time but different parts.
Your method “You may rather allocate several buffer with unified memory” sounds workable, but how can I allocate several buffers with unified memory? For example:

for（int i=0; i<10; i++）{
    cudaMallocManaged(&unified_buffer[i], memSize);
}

From my understanding, even there are 10 starting pointer in above code, but they are all regarded as one unified memory, like GPU is accessing unified_buffer[2], could CPU access unified_buffer[1]?
(and above code is my current implementation and I did have Bus Error(core dump))

Could you shed some light upon how I can do several buffer with unified memory, so GPU could work on unified_memory[2], and CPU and work on unified_memory[1].

Thank you very much.

heyworld · October 16, 2018, 10:00pm

In my case, CPU and GPU need to access unified memory which is not supported by TX2 hardware, do you have other methods that could help?

AastaLLL · October 19, 2018, 5:41am

Hi, heyworld

Both CPU and GPU can access unified memory.
You can find some information in this document:

Could you share more detail about your use case?
So that we can share a further suggestion for you.

Thanks.

heyworld · October 19, 2018, 5:52am

Hi AastaLLL,

Sorry, what I mean is in my case, CPU and GPU need to access unified memory at the same time. Multi-threading is used in my case, CPU and GPU will access unified memory at the same time but different address of unified memory.

AastaLLL · October 26, 2018, 6:55am

Hi,

Concurrent access is not supported on TX2 but available on Xavier:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

Simultaneous access to managed memory on devices of compute capability lower than 6.x is not possible, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active. However, devices of compute capability 6.x on supporting operating systems allow the CPUs and GPUs to access Unified Memory allocations simultaneously via the new page faulting mechanism. A program can query whether a device supports concurrent access to managed memory by checking a new concurrentManagedAccess property. Note, as with any parallel application, developers need to ensure correct synchronization to avoid data hazards between processors. ----------------------------------------------------------------------------------------------------------------------

Thanks.

Topic		Replies	Views
The memory sharing between cpu and gpu in Jetson TX2 Jetson TX2	6	7405	October 18, 2021
Unified Memory Access using Jetson TX2 Jetson TX2	5	2425	October 18, 2021
TX2 GPU and CPU shared same memory space Jetson TX2	2	1447	October 18, 2021
Dual problems with unified memory Jetson Nano	8	1326	October 14, 2021
Unified Memory On TX1 Jetson TX1	4	924	October 18, 2021
Why the access speed of memory allocated by cudaMallocHost is so slow? Jetson TX2 cuda	8	817	October 18, 2021
Unified memory and concurrent C++ objects Jetson TX2	10	2690	October 18, 2021
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3450	August 31, 2018
Usage of Unified Memory on R28.1 vs R24.2.1 Jetson TX2	2	569	October 18, 2021
Zero Copy vs. CudaMemcpy on Jetson TK1 Jetson TK1	4	1575	May 18, 2016

CPU operation is very slow on memory allocated by cudaMallocHost

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-coherency-hd

Related topics