Is it valid to concurrently read and write to disjoint segments of a single buffer allocated via cudaMallocHost

To provide more context, the intended use case is to build a fast, simple allocator on top of a single large allocation created with cudaMallocHost.

“Allocations” created with this allocator will be 1024 byte aligned segments of the original large allocation. The plan is to use these “allocations” just as we would use an allocation created via cudaMallocHost. So portions of the original large allocation may be read from/written to by the device as other sections are read from/written to by the host.

Is this valid? Will it induce any performance issues due to additional cache operations? This seems to be working for us (providing a solid performance improvement), but I want to make sure this is defined behavior.

Yes, it is valid

1 Like

It is too late in the day for my brain to reason about this properly, but do we not need some qualifier here regarding granularity? If not in terms of correctness, then at least in terms of performance? I am thinking about an analogue to false sharing of cache lines.

So, hypothetically something like “disjoint, defined as not jointly occupying any 64-byte aligned 64-byte chunk of memory”.

1 Like

I agree, to avoid performance degradation cache lines should not be shared.
OP is using 1024-byte aligned chunks so false sharing between CPU and GPU should not be an issue.

1 Like

Thanks! Just out of curiosity, does this remain valid for unified memory (cudaMallocManaged)?