Is Pinned memory non-atomic read/write safe on Xavier devices?

Cuda Version 10.2 (can upgrade if needed)
Device: Jetson Xavier NX/AGX

I have been trying to find the answer to this across this forum, stack overflow, etc.

So far what I have seen is that there is no need for a atomicRead in cuda because:
A properly aligned load of a 64-bit type cannot be “torn” or partially modified by an “intervening” write. I think this whole question is silly. All memory transactions are performed with respect to the L2 cache. The L2 cache serves up 32-byte cachelines only. There is no other transaction possible. A properly aligned 64-bit type will always fall into a single L2 cacheline, and the servicing of that cacheline cannot consist of some data prior to an extraneous write (that would have been modified by the extraneous write), and some data after the same extraneous write.” - Robert Crovella

However I have not found anything about cache flushing/loading for the iGPU on a tegra device. Is this also on “32-byte cachelines”?

My use case is to have one kernel writing to various parts of a chunk of memory (not atomically i.e. not using atomic* functions), but also have a second kernel only reading those same bytes in a non-tearing manner. I am okay with slightly stale data in my read (given the writing kernel flushes/updates the memory such that proceeding read kernels/processes get the update within a few milliseconds). The write kernel launches and completes after 4-8 ms or so.

At what point in the life cycle of the kernel does the iGPU update the DRAM with the cached values (given we are NOT using atomic writes)? Is it simply always at the end of the kernel execution, or at some other point?

Can/should pinned memory be used for this use case, or would unified be more appropriate such that I can take advantage of the cache safety within the iGPU?

According to the Memory Management section here we see that the iGPU access to pinned memory is Uncached. Does this mean we cannot trust the iGPU to still have safe access like Robert said above?

If using pinned, and a non-atomic write and read occur at the same time, what is the outcome? Is this undefined/segfault territory?

Additionally if using pinned and an atomic write and read occur at the same time, what is the outcome?

My goal is to remove the use of cpu side mutexing around the memory being used by my various kernels since this is causing a coupling/slow-down of two parts of my system.

Any advice is much appreciated. TIA.

1 Like

@Robert_Crovella or @pshin, would love to know the answers to some of the questions in this post!

1 Like

Since I have not gotten an answer I have posted here as well.

This is not an answer to the question, but nevertheless I’ll make a few comments, to delineate things as I see it.

The linked SO question in my view pertains to:

  • CUDA discrete GPUs
  • considering activity when all accessors are CUDA threads
  • the possibility of “tearing”, not questions related to memory ordering.

One of the answers there suggests there is “more to atomics” than tearing, and then describes a bunch of material related to ordering (in my view). It’s all good material. If your question pertains to ordering, pay attention to that. In my view the original SO question there is not asking about ordering, but the OP then stated that they wanted something like C++ atomic load, which has ordering specifiers. Still not entirely clear if that was the important part of their request, or not, but it might have been.

If we leave ordering aside, and subject to the bullets I have already listed, I stand by my comments. I’m quite convinced that two CUDA threads, accessing the same address in a properly aligned fashion, both accessing the same type there, up to 64-bits in size, cannot get a torn value. No atomics, volatile, __threadfence() or other memory gymnastics should be required to achieve that. My statement says nothing about ordering, stale values, or anything like that.

This question, in my view, is asking about the interaction between the CPU thread and one or more CUDA threads on Tegra. I don’t know about that and have therefore chosen not to provide any answers here.

Understood. Thanks for the response @Robert_Crovella . Luckily most of what we are planning to do will come from the GPU, so this answer mostly work for us.

I am still curious about the CPU side access for read/write for future proofing, if anyone else wants to take a stab at it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.