Sharing GPU with others makes me can't fetch remote data?

When using nvshmem, I’m using two GPUs as a team. GPU 1 attempts to retrieve remote data from GPU 0 using nvshmem_get_mem or nvshmem_g, but only gets 0, which should a non-zero value.
And someone else’s process is currently running on GPU 0.
But when I deploy these two processes to 2 unused GPUs, the program runs normally.
My code in a cuda kernel:

        nvshmem_getmem(
        d_recvbuf, 
        d_sendbuf, 
        sizeof(int), mapped_rank); // mapper rank is 1 for process 0
        nvshmem_quiet();
        printf("rank_in_team=%d, idx=%d, recv value=%d\n", rank_in_team, idx, d_recvbuf[0]);
        nvshmem_quiet();
        int v = nvshmem_int_g(d_sendbuf, mapped_rank);
        printf("rank_in_team=%d, idx=%d, remote value=%d\n", rank_in_team, idx, v);

Both recv value=0 and remote value=0, which should not be 0.
I’m thinking that sharing the GPU with others shouldn’t affect the correctness of nvshmem, but that doesn’t explain the behavior I’m experiencing.
Or am I not using the correct synchronization?

I’d appreciate any help!

WTF, I tested it again after I woke up and it worked (even sharing the GPU with others).
I didn’t modify any code, it’s amazing.

Is my understanding correct that you’re doing a “crisscross” where process 0 gets data from process 1, and process 1 gets data from process 0? Then you call quiet on both sides and print out what you got from the other side?

Where and how does the sendbuf and recvbuf get allocated? How do they get set to those non-zero values? Are you calling some kind of cross-PE synchronization (e.g. nvshmem_barrier_all() after the values are set on each side, but before the crisscrossing gets are called?

Sorry for the late reply.
Thanks for your reply, I think I understand where my error lies. For both GPU0 and GPU1, I set the send/recv buffer contents using a CUDA kernel and executed cudaDeviceSynchronize.
However, I didn’t call cross-PE synchronization, which I suspect is the source of this strange behavior.
Thanks for the heads-up!

Yes, since the communication is one-sided, it would be on the programmer to synchronize before the first communication is performed. And cudaDeviceSynchronize only synchronizes one GPU (that is one PE) with respect to itself. A nvshmem_barrier of some kind is needed.