I’m currently writing a term paper on the GK 110 architecture. Therefore I’m wondering how the L1 cache requests behave:
First possibility: The L1 cache request is broken down into cache lines and all needed cache lines are completely transferred to the load store units. Thus, if the threads of a warp don’t access the L1 cache in a coalescing manner, much of the L1 bandwidth is wasted.
Second possibility: This wasting would be avoidable, if the L1 cache didn’t always transfer a complete cache line, but just those bytes in a cache line, which are actually needed. Since the L1 cache and the shared memory use the same hardware, at least a part of the L1 cache hardware should be able to do so.
Which of both possibilities is true?
I’d suppose that the first possibility is the truth, but I’m actually not sure.
Thanks for help in advance! :)