Say I have an allocation created via cudaMallocHost on a UVA platform. I believe it is a valid pattern to enqueue an operation that writes to this region then enqueue an operation that reads from that region without any additional synchronization.
Assuming said assumption is true, is it also valid to “fork” the stream after the writing operation and enqueue a reading operation on both the origin stream and the forked stream?
Consider something like the following pattern:
write_op, read_op0, read_op1; // read_op0 and read_op1 read a region written to by write_op
stream0, stream1;
event0, event1;
enqueue(write_op, stream0);
event_record(event0, stream0);
event_wait(stream1, event0);
enqueue(read_op0, stream0);
enqueue(read_op1, stream1);
event_record(event1, stream1);
event_wait(stream0, event1);
I think this is a sufficiently isolated question. And perhaps one people may be more willing to help with, since it doesn’t require reading through all the details in the other post.
If splitting questions up in this manner goes against expected etiquette of this forum, let me know and i’ll merge this into the original post.
I cannot think of potential issues with multiple CUDA streams belonging to the same process that all access the same memory allocation in read-only fashion.
Do you have a minimal reproducer program implementing this scenario that shows some kind of unexpected behavior?
Note that (to the best of my knowledge) true concurrency does not really apply when transferring data across PCIe, as there can only be one transfer in a given direction at any particular time. In as far as longer transfers could be broken up into smaller chunks, the transfers of chunks for different streams could intermingle.
Probably it does not go against any etiquette. Your question sounded a lot like your last finding in the other post.
My recommendation (I am just a user here) would be: If it is independent enough, create a new thread. But you should give a bit more context in your questions. Writing a title and one sentence is very short. Especially if you have to read the title again to understand the single sentence of your question.
People do not want to read through details. However, guessing what you mean, is too short for posts.
Just some advice.
Thank you otherwise for the interesting questions.
All solid points, thanks for the response! And apologies for my overly defensive reply earlier. My initial aim was to see if there was perhaps some documentation on this I was missing. But, yes, I probably did not add enough context to make this truly independent from my other post. Will go ahead and add more now.
I’m not aware of any situation on a GPU where simultaneous reads, however they may come about, with no “nearby” writes, to any form of memory (global, shared, constant, etc.) could be “invalid”, somehow resulting in an error of any kind. That includes pinned memory (which is formally a type of global memory).
I would consider a “nearby” write to be one that is not separated by an appropriate barrier from the read activity.
In the case of global memory, writing in a previous kernel (or a previous cudaMemcpy() operation) would have a sufficient barrier from reads in a subsequent kernel. There are other possibilities as well, such as writes prior to a proper grid.sync() operation, followed by reads subsequent to it.
In the case of constant memory, I would generally expect that the writes occur via cudaMemcpyToSymbol(), prior to the kernel launch where the reads occur. That is a sufficient “barrier”.
In the case of shared memory, I would expect that writes occur prior to the issuance of a __syncthreads() call, followed by the read activity. That would be a sufficient barrier.
My usage of “barrier” here is really combining two ideas; synchronization and visibility. But the above examples suffice for both.
Are there other possibilities? Probably. It’s not possible for me to give a comprehensive or exhaustive answer to such a question.
I’m primarily focused on temporality of accesses to the same location here. If I have properly naturally-aligned activity originating from CUDA device threads to non-overlapping locations X and Y, writes to X should not impact reads from Y, AFAIK.