Hi!
I am fairly new to programming with cuda and have a question. We currently are developing an API which is used in an application where we stream images from multiple devices.
The setup is, that we create a cudaMemoryRessource (similar to std::pmr) which handles all the memory allocation for all devices using cudaMallocManaged
calls. My understanding was, that with cudaMallocManaged we should be able to access the memory location with host and cuda code and if we run the application in a single thread, this is the case. However when we want to grab images simultaneously using two seperate threads we run into problems with accessing the memory locations.
An example. We have a cuda class which does image manipulation and for that we allocate a memory block for a calibration struct using the cudamemoryressource:
m_pCalibration =
std::make_unique<memory::TypedMemoryBlock<utils::Calibration>>(pCudaMr, calibration);
This cuda class has a getter and setter method because it can happen that the calibration changes. This getter and setter are mostly called from non cuda code. Now in a single threaded application using the devices sequentially this is not a problem. However when we run them in seperate threads this leads to segmentation faults when for example accessing the calibration or even trying to copy the generated images.
Is there anything we should watch out for using cudaMallocManaged from different threads?