Are vload and vstore atomic?

Keldor314 · January 7, 2010, 4:05am

Are the vector load and store instructions atomic? I know that they could be implemented as such by use of shared memory to stage each component to a different thread, thus resulting in a single coalesced access, but I don’t know if Nvidia’s implementation works like this. Can anyone shed some light on this? It would be very useful with regards to cross-platform code if they were, since otherwise I have to write the atomic coalesced transaction by hand, and have different code paths for different architectures…

jcpalmer · January 7, 2010, 8:29pm

I cannot answer your question, but in the case you do not get a favorable result, images are sort of close. You can only get 4 values at a time, but there are other benefits like caching, & likely seamless portability. NVIDIA’s implementation has a high latency, so large work units are required to hide this.

added: It would be nice if there were 1D textures with high limits, but it seems the designers of OpenCL want to push using global memory.

Keldor314 · January 10, 2010, 1:18am

Unfortunately, using images is not an option, since I need full read/write to the buffer in question. The buffer is actually a pool of points, which are selected at random, read in, modified, and replaced. Collisions at the level of multiple points selecting the same point are acceptable as long as the read and write are both atomic, but partially updated points - e.g. from non-atomic reads or writes colliding - cause artifacts.

Actually, it’s a bit more complicated than that, since there are several point buffers. Each iteration, we read in a new point, write the old point back to that location, then pick a function at random from a list of possible functions available to points in the current pool. This function applies a transformation to the point, and selects the point pool for the point to be stored to in the next iteration. Still, the basic requirements for read atomicity and write atomicity are the same.

What would be ideal would be full atomic exchange for the entire point. However, there is no documentation regarding the behavior of atomics with respect to memory coalescing. There have been some tests done with atomics which indicate that they use the same coalescing rules as any other memory transaction, but can we depend on this? Some official word regarding this would be helpful.

Topic		Replies	Views
Atomic Functions Performance CUDA Programming and Performance	6	3682	August 22, 2008
Why isn't this code atomic? CUDA Programming and Performance	1	1405	January 25, 2010
Are load and store operations in shared memory atomic? CUDA Programming and Performance	1	659	June 20, 2022
Does the st.v4 and multimem.v4 instruction atomic? CUDA Programming and Performance	1	25	March 18, 2025
atomicAdd with float2 no API support, workarounds ? CUDA Programming and Performance	23	5185	January 28, 2021
Atomic operations Noob question. CUDA Programming and Performance	5	1472	March 5, 2009
Atomic operations for multi-GPU Is it possible to do that? CUDA Programming and Performance	9	8144	August 27, 2009
Memory Coalescing CUDA Programming and Performance	5	9270	October 15, 2011
write results in parallel creating an unknown number of data elements in each thread CUDA Programming and Performance	5	2326	January 21, 2010
Atomic operations and Block communication CUDA Programming and Performance	3	2954	December 11, 2007

Are vload and vstore atomic?

Related topics