I have few questions related to the memory access inside a streaming multiprocessor (SM)
As far as i understand;
load and store instructions are handled by the SM LD/ST units which operates on L1 cache data. And each LD/ST unit handles the memory operation of a single thread.
LD/ST units work on chunks of data e.g. 32 bytes per transaction
Is this correct ? And if so, when threads inside a warp read/write to neighboring locations (coalesced access) does all these reads/writes get assigned to a single LD/ST unit ? given that all these reads/write are within the chunk range of addresses
I mean do we care about coalesced memory access in order to utilize these LD/ST units ? or to avoid the cache miss penalty ?