cudaHostRegister on multiple threads

I experienced an issue when using cudaHostRegister on two host threads. Actually, I have two host threads, in one I’m using device 0 (cudaSetDevice(0)) while in the other I’m using device 1 (cudaSetDevice(1)). In the two threads I’m continuosly allocating memory with malloc and then using cudaHostRegister to pin the memory. After some iterations I obtain a crash on the cudaHostRegister call. While using cudaHostAlloc I’m not experiencing any issue. Could it be that cudaHostRegister is not thread-safe? Could someone provide me more insights?

probably an example would help. Are you doing rigorous CUDA error checking? That is, checking the return value of every CUDA API call, and reporting or logging an error if it is not cudaSuccess? If not, start there. That output may be useful.

If you are doing:

T *a;
a = malloc(size);
cudaHostRegister(a, ...);
...
free(a);

i.e. doing a free without first doing a cudaHostUnregister(), then that may be the issue.

A corresponding mistake isn’t really possible if you are doing:

cudaHostAlloc(&a, ...);
...
cudaFree(a);

note the cudaHostRegister documentation:

The memory page-locked by this function must be unregistered with cudaHostUnregister().

Other than that, I would need an example to offer anything further. The CUDA runtime API is thread safe except for noted exceptions around graph usage.

I’m calling cudaHostUnregister before calling cudaFree. Here’s an example of what we are doing.

cudaError_t err = cudaSuccess;

if (m_pBuf)
{
	err = cudaHostUnregister(m_pBuf);
    free(m_pBuf);
}

m_pBuf = malloc(size);
err = cudaHostRegister(m_pBuf, size, 0);

At the moment I’m not using cudaDeviceSynchronize after each call but I can modify the code to actually do rigourous CUDA error checking. However, from all my test it seems to me that the underneath functioning of cudaHostRegister and cudaHostAlloc in terms of thread-safety is different.

What I did to study the differences was to put this code snippet into two different host threads and after a while it crashes. This does not happen when using cudaHostAlloc and cudaFree instead.

EDIT2: I tried to do CUDA error checking with device syncronize but I’m unable to see something useful because of the crash itself. The call to the cudaHostRegister that is going to return != cudaSuccess is causing the crashing of the application.

The threads would each use a different m_pBuf pointer?

Perhaps you can provide a minimal viable example of the crashing code of this code snippet in multiple threads?

Also compare to

where repeated calls of cudaHostRegister and cudaHostUnregister lead to errors, which probably should not have been.

Sure the two buffer pointers are different. Below you can find the code snippet that I’m using to cause the issue

int* buf1; int* buf2;
void function1()
{
    while (true)
    {
       cudaSetDevice(0);
       cudaError_t err = cudaSuccess;

       if (buf1)
       {
	       err = cudaHostUnregister(buf1);
           free(buf1);
       }

       buf1= malloc(size);
       err = cudaHostRegister(buf1, size, 0);
    }
}

void function2()
{
    while (true)
    {
       cudaSetDevice(1);
       cudaError_t err = cudaSuccess;

       if (buf2)
       {
	      err = cudaHostUnregister(buf2);
           free(buf2);
       }

       buf2= malloc(size);
       err = cudaHostRegister(buf2, size, 0);
    }
}

int main()
 {
    std::thread thread1(function1);
    std::thread thread2(function2);

   while(true)
   {
      Sleep(10);
   }
   
   return 0;
}

I would say the the problem may be related to the usage of two different GPUs in the two threads cause I don’t have any issue if using the same device. Actually I now noticed that I forgot to mention this specification. I will update the code snippets above with cudaSetDevice

Thank you for sharing. Quick feedback before having tried running the code:

Generally looks good. I would initialize buf1 and buf2 pointers to zero to avoid unregister and free for a random address, potentially destroying data structures.

Strictly speaking some of the loops could be UB, if they are infinite and there is no observable behaviour. But I do not believe that is the reason for a crash here.

Yeah the code is just a quick example, please don’t take it as finalized. If running the same code with cudaHostAlloc and cudaFree the crash never occurs.

Does the crash occur with just doing malloc once, but repeatedly registering and unregistering?

Would it help to use a mutex around the malloc/free calls or together with the register calls? (To identify the call that is not thread-safe).

Normally malloc/free in the standard library should be thread safe.

I just tried to do malloc only once for the two buffers and never call free on those and indeed is not crashing anymore. This sounds really strange to me cause malloc should be thread-safe and when it crashes it seems to be caused by cudaHostRegister.

EDIT: The same crash happens when using new and delete instead of malloc/free

It could still be a bug in cudaHostRegister and the kernel functions it calls, because cudaHostRegister possibly gets a different address from the new malloc call than before from the first one.
You could print out the addresses and show that it does not crash, if the address stays the same, but crashes, when the address changes.

Put this together with 4K and/or 4M memory pages. Perhaps cudaHostRegister has problems with neighbouring memory regions. You could try the approach from the linked forum page, where the working solution/workaround was to use cudaHostRegister within a loop over the buffer size for 4K blocks only for each call.

There are a number of things about your posted code that are undefined or incorrect. Also there is no proper error checking that I can see.

when I run this modified version on CUDA 12.2 on a DGX H100 on linux, I get no printouts or crashes of any kind after running for 10 minutes:

#include <thread>
#include <vector>
#include <iostream>

const int size = 256*1048576;
int* buf1= NULL; int* buf2 = NULL;
void function1()
{
    bool notfinished = true;
    while (notfinished == true)
    {
       cudaSetDevice(0);
       cudaError_t err = cudaSuccess;

       if (buf1)
       {
               err = cudaHostUnregister(buf1);
               if (err != cudaSuccess) {std::cout << "oops3: " << cudaGetErrorString(err) << std::endl; notfinished = false;}
           free(buf1);
       }

       buf1= (int *)malloc(size);
       if (buf1 == NULL) {std::cout << "oops1"  << std::endl; notfinished = false;}
       err = cudaHostRegister(buf1, size, 0);
               if (err != cudaSuccess) {std::cout << "oops4: " << cudaGetErrorString(err) << std::endl; notfinished = false;}
    }
}

void function2()
{
    bool notfinished = true;
    while (notfinished == true)
    {
       cudaSetDevice(1);
       cudaError_t err = cudaSuccess;

       if (buf2)
       {
              err = cudaHostUnregister(buf2);
               if (err != cudaSuccess) {std::cout << "oops5: " << cudaGetErrorString(err) << std::endl; notfinished = false;}
           free(buf2);
       }

       buf2= (int *)malloc(size);
       if (buf2 == NULL) {std::cout << "oops2"  << std::endl; notfinished = false;}
       err = cudaHostRegister(buf2, size, 0);
               if (err != cudaSuccess) {std::cout << "oops6: " << cudaGetErrorString(err) << std::endl; notfinished = false;}
    }
}

int main()
 {
    std::thread thread1(function1);
    std::thread thread2(function2);

   while(true)
   {
      //Sleep(10);
   }

   return 0;
}

compiled with nvcc -o t213 t213.cu

I wouldn’t be able to offer anything further without an exact, actual and complete test case.

Thank you Robert! I just tried your code on the same Windows machine with two GPUs and indeed is not crashing. But I also tried the same with size = 1000 and it crashes after a few seconds. Then, it seems that is related to the memory allocated (probably if not multiple that 16bytes or something similar).
I also provide the snippet of the call stack when it crashes with your code and size = 1000.

EDIT1: Actually I tried with size = 4 * 1024 (4KB) and it crashes. With 1MB and 4MB it works. I really don’t understand the point

I tried on linux with size = 1000; and it seems to run fine for 3-4 minutes. I didn’t wait any longer than that.

Perhaps there is something specific to windows here, I don’t know.

What CUDA version are you using?

Yesterday, I tried on different machines mounting different GPUs and it seems related to something along the driver. Because with two RTX A5000 I have the same issue with both CUDA 11.1 and CUDA 12.3 while with RTX 3090 I never encounter any issues also when changing the size. I know that CUDA API use cu primitives that I imagine are different from driver to driver. And in my case the two drivers are completely different. RTX A5000s are running with 546.12 while 3090s are running with 461.4.

As a test of your theory, you could update the 3090 machine to the 546.12 driver to see if the problems occur there as well.

It’s probably also worth a test to update the RTX A5000 machine to the latest available driver to see if the symptoms there persist.