Is Pinned memory non-atomic read/write safe on Xavier devices?

mcelhennyi · March 11, 2022, 10:02pm

Cuda Version 10.2 (can upgrade if needed)
Device: Jetson Xavier NX/AGX

I have been trying to find the answer to this across this forum, stack overflow, etc.

So far what I have seen is that there is no need for a atomicRead in cuda because:
“A properly aligned load of a 64-bit type cannot be “torn” or partially modified by an “intervening” write. I think this whole question is silly. All memory transactions are performed with respect to the L2 cache. The L2 cache serves up 32-byte cachelines only. There is no other transaction possible. A properly aligned 64-bit type will always fall into a single L2 cacheline, and the servicing of that cacheline cannot consist of some data prior to an extraneous write (that would have been modified by the extraneous write), and some data after the same extraneous write.” - Robert Crovella

However I have not found anything about cache flushing/loading for the iGPU on a tegra device. Is this also on “32-byte cachelines”?

My use case is to have one kernel writing to various parts of a chunk of memory (not atomically i.e. not using atomic* functions), but also have a second kernel only reading those same bytes in a non-tearing manner. I am okay with slightly stale data in my read (given the writing kernel flushes/updates the memory such that proceeding read kernels/processes get the update within a few milliseconds). The write kernel launches and completes after 4-8 ms or so.

At what point in the life cycle of the kernel does the iGPU update the DRAM with the cached values (given we are NOT using atomic writes)? Is it simply always at the end of the kernel execution, or at some other point?

Can/should pinned memory be used for this use case, or would unified be more appropriate such that I can take advantage of the cache safety within the iGPU?

According to the Memory Management section here we see that the iGPU access to pinned memory is Uncached. Does this mean we cannot trust the iGPU to still have safe access like Robert said above?

If using pinned, and a non-atomic write and read occur at the same time, what is the outcome? Is this undefined/segfault territory?

Additionally if using pinned and an atomic write and read occur at the same time, what is the outcome?

My goal is to remove the use of cpu side mutexing around the memory being used by my various kernels since this is causing a coupling/slow-down of two parts of my system.

Any advice is much appreciated. TIA.

moezruo6 · March 15, 2022, 3:07pm

@Robert_Crovella or @pshin, would love to know the answers to some of the questions in this post!

mcelhennyi · April 5, 2022, 2:54pm

Since I have not gotten an answer I have posted here as well.

Robert_Crovella · April 5, 2022, 6:43pm

This is not an answer to the question, but nevertheless I’ll make a few comments, to delineate things as I see it.

The linked SO question in my view pertains to:

CUDA discrete GPUs
considering activity when all accessors are CUDA threads
the possibility of “tearing”, not questions related to memory ordering.

One of the answers there suggests there is “more to atomics” than tearing, and then describes a bunch of material related to ordering (in my view). It’s all good material. If your question pertains to ordering, pay attention to that. In my view the original SO question there is not asking about ordering, but the OP then stated that they wanted something like C++ atomic load, which has ordering specifiers. Still not entirely clear if that was the important part of their request, or not, but it might have been.

If we leave ordering aside, and subject to the bullets I have already listed, I stand by my comments. I’m quite convinced that two CUDA threads, accessing the same address in a properly aligned fashion, both accessing the same type there, up to 64-bits in size, cannot get a torn value. No atomics, volatile, __threadfence() or other memory gymnastics should be required to achieve that. My statement says nothing about ordering, stale values, or anything like that.

This question, in my view, is asking about the interaction between the CPU thread and one or more CUDA threads on Tegra. I don’t know about that and have therefore chosen not to provide any answers here.

mcelhennyi · April 6, 2022, 2:13pm

Understood. Thanks for the response @Robert_Crovella . Luckily most of what we are planning to do will come from the GPU, so this answer mostly work for us.

I am still curious about the CPU side access for read/write for future proofing, if anyone else wants to take a stab at it.

Topic		Replies	Views
Cache coherence of pinned memory on Jetson Xavier NX Jetson Xavier NX cuda	3	1847	October 27, 2021
Pinned memory for sharing data structures with CPU CUDA Programming and Performance	0	865	January 28, 2013
Why "Pinned host memory" is uncached for iGPU? Jetson AGX Orin cuda	1	108	November 7, 2024
Jetson TX2 zero-copy pinned memory consistency issues Jetson TX2	18	2046	September 6, 2021
pinned memory strange behaviour: is it a cache effect ? CUDA Programming and Performance	0	1212	April 14, 2011
Does anyone know what the cache here refers to? Jetson AGX Orin cuda	8	676	August 14, 2023
Will atomic operations invalidate L1? Jetson Orin NX cuda , jetson-inference	7	628	January 29, 2024
cudaMallocHost caching behavior CUDA Programming and Performance	4	855	March 1, 2019
Which write operations are atomic in CUDA? CUDA Programming and Performance	6	3520	October 8, 2017
atomicCAS mutex not working on 2080ti? CUDA Programming and Performance	3	1128	November 30, 2019

Is Pinned memory non-atomic read/write safe on Xavier devices?

Related topics