cudaHostAlloc and thread safety problems with pinned, portable memory

I am trying to do stuff with pinned, portable memory. (Cuda runtime 2.3)

The example I am working from allocates pinned memory (with cudaHostAlloc((void**)&addr, n_bytes, cudaHostAllocPortable) )
in the main thread, then launches separate host threads for each GPU.

That is OK. But I get problems when I try to allocate pinned/portable memory within each host thread. So:

A) is it OK to allocate/free pinned memory within several different host threads? (And what if each had a different CUDA context?)
B) is it OK for one thread to allocate, and another thread to free a chunk of pinned/portable memory?

B. may seem like a strange thing to do. Basically I have a class that caches arrays, up to a global limit on the cached data.
Cached data is freed up when the limit is reached. Cached data may be used by different threads, and any thread may do the freeing.
(Users of the data must always check to see if it has been freed up and if so recalculate the array. But if it is still there,
time is saved). Anyway, this is all made thread safe with mutexes, and works fine as long as I use “new” and “delete”.
It passes its unit tests.

Then I thought, why waste time and space copying that data into pinned memory before copying to the device? Why not
put it in pinned memory to start with? So I replaced “new” with “cudaHostAlloc(… cudaHostAllocPortable)” and “delete”
with “cudaFreeHost()”.

And now it passes all the unit tests apart from the one that exercises thread safety, where it crashes (seg faults) randomly.

NB in this unit test I am not even copying to the device - just repeatedly allocating, writing, reading and freeing, in
two different threads. (I call cudaSetDevice(0) before launching threads so in fact the CUDA context should be the same).

After scratching my head a bit, I wonder if perhaps cudaHostAlloc() is not quite thread safe to the extent I need?
Though, since I am (trying to) provide thread safety with mutexes, unless CUDA is explicitly using the thread id in some
way, it should be OK.

All I can find out for sure is that pinned/portable memory can be accessed by multiple devices which are controlled
from different host threads. If it is safe to allocate/free in different host threads, and am not sure.

Does anyone know?

I have no idea what the allowed behavior was for this in CUDA 2.3, but I could imagine the pinned memory allocation information getting attached to the CUDA context associated with the allocating thread.

This almost certainly has to be fixed in CUDA 4.0. The CUDA runtime implementation (and semantics) have been totally restructured for thread-safety. If you have a registered developer account, I would take a look at the CUDA 4.0 rc1.

It seems to be OK to allocate and deallocate pinned memory in different threads as long as you allocate/deallocate a given chunk in he same thread. So you may well be right about pinned memory being tied to the CUDA context. Also, allocating pinned memory in lots of small chunks seems to be very slow, so this was not a good solution. Best to do that with malloc/free (or new/delete) and copy everything to a big chunk of pinned memory later.

Not tried it yet - next iteration maybe. Thanks.