cudaMallocHost and pthreads issues with accessing memory from different threads


I am trying to take advantage of multiple graphics cards, thought my tests are currently running with just the one. The outline of the program goes:


Allocate memory on the host, and load in raw data.

spawn a thread for each card, each thread will

      Initialise the card, and allocate memory on it.

forall tiles{

         copy a tile across onto the card from the hoast using cudaMemcpy3dAsync

run analysis on this tile

copy results back


end threads

compare results with those generated from the gold function.


As I am using streams, it is necessary to use [font=“Courier New”]cudaMallocHost[/font] to allocate memory for the raw data and the results on the host, however it appears that the memory allocated this way is only visible to the thread is was allocated in, even though the pointer to it is a global variable. The code all runs fine in the emulator mode. I have experimented with moving the allocation into different parts of the code, and any function using the memory from a different thread fails with either an argument error in the case of [font=“Courier New”]cudaMemcpy3DAsync[/font], or a segmentation error if it is a direct access from the code (the test against the gold results if I allocate memory inside the spawned thread).

Is this the expected behaviour? Does [font=“Courier New”]cudaMallocHost[/font] only allow the memory to be used by the thread that called it?

If so, is there a work round? I realise I could allocate a single tile and then copy the data into this from memory allocated with the standard [font=“Courier New”]malloc[/font], but this seams very inefficient.

If it is not the expected behaviour, does anyone have any ideas what I am doing wrong?

Many thanks in advance


In my experience, memory allocated in one thread using cudaMallocHost IS accessible by other threads. However, the memory is only ‘pinned’ in the thread is it allocated in. In the other threads, it is seen as unpinned memory. There is definitely an issue with accessing GPU memory in thread B if it was allocated in thread A.

I suspect the difference between my experiences and what you are encountering may be the fact that you are using streams. I was not. However, glancing at the documentation, I can’t seem to find anything that supports this.

This thread doesn’t address anything with streams, but it does go over the cudaMallocHost() and pthreads issues. I thought you may find it helpful.

This is the behavior I get too.

Tim Murray has mentioned that pinned memory for all GPUs is coming in a future CUDA release (that is not that 2.1 release due in beta form any day now).

If you need async transfers in all of your threads in the meantime, you will need allocate separate cudaMallocHost areas in each thread.