It uses the L2 cache as an intermediary as far as I can tell, but still originates from global memory.
Also the L2 capacity is only 40 MB and incoming data to L2 only sums up to ~18.9 MB, so assuming that the cache was completely filled before execution, there are still ~75 MB that aren’t represented here for where they come from.
L2 bytes hit + miss = 130.73 MB + 10.55 MB = 141 MB
NCU memory table has this metric.
Given the L2 hit rate and the DRAM bytes read the following is likely true:
The total global memory footprint loaded into shared memory across all thread blocks is <= 10.55 MB implying that multiple thread blocks are loading the same data into shared memory resulting in a high L2 hit rate.