I know in CUDA programming, memory reads at different levels can overlap. For example, data transfers from Global memory to Shared memory and from Shared memory to registers can overlap. But can read and write operations at the same memory level overlap, such as overlapping reads and writes to shared memory?
Thank you very much!
All operations on a GPU have latency. That is, it takes multiple clock cycles for any transaction to complete, although new requests can be made typically every clock cycle or every other clock cycle.
In that context, nearly anything can overlap.
If you mean “can a shared load and a shared store be issued to shared memory in the same clock cycle in the same SM”, that is not well-specified by NVIDIA, and has to do with bandwidth to share memory. It is not much different than asking if two reads can be issued. You would need to study microbenchmarking papers to see what the results are likely to be. However my mental model is that shared memory typically has a bandwidth such that it can accept full rate (i.e. one request per clock) from a single sub-partition in a single SM. that may not be accurate in all cases. I do not assume, for example, that shared memory can accept requests simultaneously from all 4 sub partitions in a modern GPU SM. My expectation is that in some unspecified way, those requests would get serialized, to some degree.