Cuda memory access with cudaMallocManaged

PS1234 · September 11, 2024, 6:43am

Hi!

I am fairly new to programming with cuda and have a question. We currently are developing an API which is used in an application where we stream images from multiple devices.

The setup is, that we create a cudaMemoryRessource (similar to std::pmr) which handles all the memory allocation for all devices using cudaMallocManaged calls. My understanding was, that with cudaMallocManaged we should be able to access the memory location with host and cuda code and if we run the application in a single thread, this is the case. However when we want to grab images simultaneously using two seperate threads we run into problems with accessing the memory locations.

An example. We have a cuda class which does image manipulation and for that we allocate a memory block for a calibration struct using the cudamemoryressource:

  m_pCalibration =
    std::make_unique<memory::TypedMemoryBlock<utils::Calibration>>(pCudaMr, calibration);

This cuda class has a getter and setter method because it can happen that the calibration changes. This getter and setter are mostly called from non cuda code. Now in a single threaded application using the devices sequentially this is not a problem. However when we run them in seperate threads this leads to segmentation faults when for example accessing the calibration or even trying to copy the generated images.

Is there anything we should watch out for using cudaMallocManaged from different threads?

Robert_Crovella · September 11, 2024, 8:44am

are you running this on windows?
are you running this on a Maxwell or Kepler device?

PS1234 · September 11, 2024, 8:53am

Oh im sorry, should have added way more information.

I run this on Nvidia Jetson Orin Nano/NX devices. And I already found out that the concurrentManagedAccess=0 which probably is not a good sign. I also found this thread, which you already answered and I think this might also be part of our problem: multithreading - Can CUDA unified memory be written to by another CPU thread? - Stack Overflow

Robert_Crovella · September 11, 2024, 9:40am

Yes, on Jetson, managed memory sometimes/often behaves in a “pre-pascal” way (concurrentManagedAccess is false).

This may be of interest. It may be applicable if your memory usage in each thread is “independent” from other threads.

Also, you may get better help with Jetson issues by posting on a relevant Jetson forum.

Topic		Replies	Views
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	39	November 19, 2024
Unified memory and concurrent C++ objects Jetson TX2	10	2502	October 18, 2021
Unified Memory On TX1 Jetson TX1	4	855	October 18, 2021
Unified Memory Access using Jetson TX2 Jetson TX2	5	2324	October 18, 2021
Access memory of cudaMallocManaged after launch kernel will cause crash Jetson AGX Orin cuda	5	391	December 5, 2023
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	31	November 25, 2024
How to manage CUDA memory? Jetson Xavier NX cuda , python	4	651	December 28, 2022
How to access the same array with cpu and gpu simultaneously on jetson orin Jetson AGX Orin cuda	2	63	July 30, 2024
Questions about efficient memory management for TensorRT on TX2 CUDA Programming and Performance	8	1994	October 12, 2021
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	3825	January 7, 2022

Cuda memory access with cudaMallocManaged

Related topics