on load/ store units

hello,

a bit more information on the load/ store units, if i may

if the load/ store units mind global memory access, and if global memory reads move through L2 and L1 cache, do the load/ store units solely take care of the global memory reads, or does the sm have some role to play too?

this is really what i am trying to grasp:

if the load/ store units manage global memory access, how many (warp-sized) global memory reads would a load/ store unit be able to handle/ mind at a time?

given that global memory reads have latency, i am really assuming/ hoping the load/ store units have some pipe or buffer, to be able to mind – issue and wait on – multiple warp-sized global memory reads at a time…?

if the sm has 32 load/ store units, and if the sm can mind up to 64 warps, that already amounts to a theoretical 64 warp-based/ warp-sized global reads, or 2 warp-sized global reads per load/ store unit (assuming the sm runs 2 kernels, each with 32 warps, to get to the max number of warps, and equally assuming all warps issue global memory reads at more or less the same time)

if a kernel contains multiple global reads, with the global reads not really dependent on prior reads or such, such that the successive global reads can (theoretically) be issued without delay, then the theoretical warp-sized global-reads per load/store unit figure may even be higher

so how many such global reads can be simultaneously issued by warps before the load/ store units would be unable to accept any more?

assume memory access to be coalesced

i suppose i could reference the ptx document for clues; but any hints would be greatly appreciated

the math above is of course wrong; 64 warps issuing global reads at the same time should result in 64 warp-issued global reads per 1 block of a warp-wide load/ store units, and not 32

when i look at ptx instructions like: mov and ld, i get the impression that a load/ store unit can only manage one load at a time, and that x global reads issued from y warps on the same sm, would take something like:

(x * issue-time) + (x * latency); where issue time is the average time for the warp to issue the request on the sm

instead of: (x * issue-time) + (x/factor * latency), for example

but, if i take the value of 1 warp load per load/ store unit at a time, and the number of sm’s, and the value of the latency, i do not get near theoretical global memory bandwidth…? (implying that one would hardly achieve global memory bandwidth with predominantly global reads…?)

does the compiler ever use the prefetch, prefetchu instructions?