C2050 memory model

Hi

Is there any document which describes in more detail the memory model of the tesla C2050 ?

Especially with the L1 and more importantly the L2 cache, it be great if more detail is available…specifically i had the following questions

  • The NVIDIA profiler gives numbers for only the L1 cache hit/miss …any tool to get these numbers for the L2 cache ?

  • The L2 cache is across all the blocks, and what is its impact specifically for data being load repeatedly within a block

  • How does this cache model, relax the coalescing requirement ? (it’s mentioned that coalescing requiements are not as stringent, but what would be the preferred access patter, if any ?)

Is there any tool available for providing info on data access pattern of the GPU ?

Thanks.

I’ve seen an interesting presentation by David Kirk: [url=“http://impact.crhc.illinois.edu/sslecture/lecture7_2010_cuda_fermi_overview_DK.pdf”]http://impact.crhc.illinois.edu/sslecture/...overview_DK.pdf[/url]. It mentions memory hierarchy too. In particular, he cites 230 GB/s L2 cache bandwidth. It is ~2x faster than global memory peak bandwidth.

Isn’t coalescing model described in the programming guide?

I’ve seen an interesting presentation by David Kirk: [url=“http://impact.crhc.illinois.edu/sslecture/lecture7_2010_cuda_fermi_overview_DK.pdf”]http://impact.crhc.illinois.edu/sslecture/...overview_DK.pdf[/url]. It mentions memory hierarchy too. In particular, he cites 230 GB/s L2 cache bandwidth. It is ~2x faster than global memory peak bandwidth.

Isn’t coalescing model described in the programming guide?

thanks, this is on the the very few presentations with more detail info on the fermi cards.

I am not referring the traditional coalescing model. The programming guide has the description of the coalescing model and it goes onto say that the with Fermi the coalescing requirement are far less stringent. I was curios about the more specific implications of L1/L2 cache on the memory access patterns ?? Any info on that would be helpfull…

thanks

thanks, this is on the the very few presentations with more detail info on the fermi cards.

I am not referring the traditional coalescing model. The programming guide has the description of the coalescing model and it goes onto say that the with Fermi the coalescing requirement are far less stringent. I was curios about the more specific implications of L1/L2 cache on the memory access patterns ?? Any info on that would be helpfull…

thanks

In Chapter G.4.2 they are quite specific about how L1/L2 caches are involved:

“Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.”

In Chapter G.4.2 they are quite specific about how L1/L2 caches are involved:

“Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.”

In this presentation, on slide 15 regarding the cache on the fermi, he says

[i]" Not designed for CPU-style reuse, so don’t worry about blocking.

Designed to improve perf for misaligned access, small strides, some register spilling "[/i]

So, this is not entirely clear ?? How is it different from the conventional caches - any more details on this ?

Thanks

In this presentation, on slide 15 regarding the cache on the fermi, he says

[i]" Not designed for CPU-style reuse, so don’t worry about blocking.

Designed to improve perf for misaligned access, small strides, some register spilling "[/i]

So, this is not entirely clear ?? How is it different from the conventional caches - any more details on this ?

Thanks