On slide 33 of this presentation - http://www.wasp.uwa.edu.au/home/news?f=282223 - Mark Harris claims that if you use a compiler switch to turn Fermi’s L1 cache off, you can get 32B segment transactions. Could anyone check that it works? I’d be very interested to see a microbenchmark… So far I can see only 128B transactions :(
On slide 33 of this presentation - http://www.wasp.uwa.edu.au/home/news?f=282223 - Mark Harris claims that if you use a compiler switch to turn Fermi’s L1 cache off, you can get 32B segment transactions. Could anyone check that it works? I’d be very interested to see a microbenchmark… So far I can see only 128B transactions :(
With the official 3.2 driver and profiler you should also be able to see differences in the L2 read requests. The request counters are incremented by 1 for each 32B segment (and also are for the entire chip, since L2 is a “global” resource as opposed to an SM). So, a caching load that misses in L1 will request 4 segments from L2, incrementing that count by 4. So, if you have a scattered access pattern where each thread would be fetching a unique L1 line, L2 read request count for caching-loads should be 4x higher than that for non-caching loads.
With the official 3.2 driver and profiler you should also be able to see differences in the L2 read requests. The request counters are incremented by 1 for each 32B segment (and also are for the entire chip, since L2 is a “global” resource as opposed to an SM). So, a caching load that misses in L1 will request 4 segments from L2, incrementing that count by 4. So, if you have a scattered access pattern where each thread would be fetching a unique L1 line, L2 read request count for caching-loads should be 4x higher than that for non-caching loads.
No, I just measure runtime. I recall that profilers was reporting the number of different kinds of transactions, but can’t find it anymore. Doest it work on Fermi?
No, I just measure runtime. I recall that profilers was reporting the number of different kinds of transactions, but can’t find it anymore. Doest it work on Fermi?
Thanks, I see now that I get 4x fewer DRAM reads if using non-cached accesses. So, it must indeed use 4x smaller transactions. However, this doesn’t lead to 4x speedup. It seems that performance in scattered access is bound not by pin bandwidth. But what is the bottleneck then?
Could this be the performance of DRAM chips themselves? Here are the specs for 1Gb GDDR5 chips from hynix: http://www.hynix.com/datasheet/eng/graphics/details/graphics_26_H5GQ1H24AFR.jsp?menu1=01&menu2=04&menu3=05&menuNo=1&m=3&s=4&RK=26 . It says t32AW = 184ns, which means it can’t open more than 32 pages per 184 ns. In scattered access you have to open one page per every access. There are 12 chips on GTX480, so this bounds the aggregate access rate to 32*12 accesses per 184 ns = 2.1 Gaccesses/s. For 64-bit accesses this is 17 GB/s or 9.4% of pin bandwidth. If correct, this clearly is a bottleneck.
Thanks, I see now that I get 4x fewer DRAM reads if using non-cached accesses. So, it must indeed use 4x smaller transactions. However, this doesn’t lead to 4x speedup. It seems that performance in scattered access is bound not by pin bandwidth. But what is the bottleneck then?
Could this be the performance of DRAM chips themselves? Here are the specs for 1Gb GDDR5 chips from hynix: http://www.hynix.com/datasheet/eng/graphics/details/graphics_26_H5GQ1H24AFR.jsp?menu1=01&menu2=04&menu3=05&menuNo=1&m=3&s=4&RK=26 . It says t32AW = 184ns, which means it can’t open more than 32 pages per 184 ns. In scattered access you have to open one page per every access. There are 12 chips on GTX480, so this bounds the aggregate access rate to 32*12 accesses per 184 ns = 2.1 Gaccesses/s. For 64-bit accesses this is 17 GB/s or 9.4% of pin bandwidth. If correct, this clearly is a bottleneck.