Are the vector load and store instructions atomic? I know that they could be implemented as such by use of shared memory to stage each component to a different thread, thus resulting in a single coalesced access, but I don’t know if Nvidia’s implementation works like this. Can anyone shed some light on this? It would be very useful with regards to cross-platform code if they were, since otherwise I have to write the atomic coalesced transaction by hand, and have different code paths for different architectures…
I cannot answer your question, but in the case you do not get a favorable result, images are sort of close. You can only get 4 values at a time, but there are other benefits like caching, & likely seamless portability. NVIDIA’s implementation has a high latency, so large work units are required to hide this.
added: It would be nice if there were 1D textures with high limits, but it seems the designers of OpenCL want to push using global memory.
Unfortunately, using images is not an option, since I need full read/write to the buffer in question. The buffer is actually a pool of points, which are selected at random, read in, modified, and replaced. Collisions at the level of multiple points selecting the same point are acceptable as long as the read and write are both atomic, but partially updated points - e.g. from non-atomic reads or writes colliding - cause artifacts.
Actually, it’s a bit more complicated than that, since there are several point buffers. Each iteration, we read in a new point, write the old point back to that location, then pick a function at random from a list of possible functions available to points in the current pool. This function applies a transformation to the point, and selects the point pool for the point to be stored to in the next iteration. Still, the basic requirements for read atomicity and write atomicity are the same.
What would be ideal would be full atomic exchange for the entire point. However, there is no documentation regarding the behavior of atomics with respect to memory coalescing. There have been some tests done with atomics which indicate that they use the same coalescing rules as any other memory transaction, but can we depend on this? Some official word regarding this would be helpful.