Consider the following scenario:
- There are two host threads, T1 and T2, and two CUDA streams S1 and S2 owned and used exclusively by corresponding host thread.
- T2 is initially locked on some semaphore and idle.
- T1 allocates managed memory and associates it with S1 in a single-stream mode.
- T1 launches some kernels on S1 that will populate memory. Then T1 synchronizes on S1 to ensure that memory is populated.
- T1 signals semaphore and unlocks T2
- From now on, T1 and T2 work concurrently. Both of them are doing read-only access to managed memory, but T1 runs host code and T2 launches GPU kernel on S2.
As I understand, there is no danger for cache coherency and consistency, because all write accesses are explicitly synchronized. But there is a concurrent memory access from host and GPU side by two threads, which is AFAIK not allowed in unified memory model despite being read-only.
This code is run on Jetson AGX Xavier, so the question relates only to specific Xavier CUDA implementation. But since it is still a generic CUDA question I decided to post it in a generic CUDA forum. Please move the thread if the question is more suited for the Xavier forum.
Q1: Is this access pattern valid from the perspective of unified memory model?
Q2: If it is not, how to efficiently run such code without violating CUDA memory constraints and still have zero-copy access?