32B memory transactions on Fermi?

On slide 33 of this presentation - http://www.wasp.uwa.edu.au/home/news?f=282223 - Mark Harris claims that if you use a compiler switch to turn Fermi’s L1 cache off, you can get 32B segment transactions. Could anyone check that it works? I’d be very interested to see a microbenchmark… So far I can see only 128B transactions :(

Vasily

On slide 33 of this presentation - http://www.wasp.uwa.edu.au/home/news?f=282223 - Mark Harris claims that if you use a compiler switch to turn Fermi’s L1 cache off, you can get 32B segment transactions. Could anyone check that it works? I’d be very interested to see a microbenchmark… So far I can see only 128B transactions :(

Vasily

Are you passing the right flag ( -Xptxas -dlcm=cg ) to nvcc?

Are you passing the right flag ( -Xptxas -dlcm=cg ) to nvcc?

I am.

I am.

Vasily,
How do you see these transactions? Profiler?

Vasily,
How do you see these transactions? Profiler?

It works, see slide 26 of http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Analysis_Driven_Optimization.pdf

With the official 3.2 driver and profiler you should also be able to see differences in the L2 read requests. The request counters are incremented by 1 for each 32B segment (and also are for the entire chip, since L2 is a “global” resource as opposed to an SM). So, a caching load that misses in L1 will request 4 segments from L2, incrementing that count by 4. So, if you have a scattered access pattern where each thread would be fetching a unique L1 line, L2 read request count for caching-loads should be 4x higher than that for non-caching loads.

It works, see slide 26 of http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Analysis_Driven_Optimization.pdf

With the official 3.2 driver and profiler you should also be able to see differences in the L2 read requests. The request counters are incremented by 1 for each 32B segment (and also are for the entire chip, since L2 is a “global” resource as opposed to an SM). So, a caching load that misses in L1 will request 4 segments from L2, incrementing that count by 4. So, if you have a scattered access pattern where each thread would be fetching a unique L1 line, L2 read request count for caching-loads should be 4x higher than that for non-caching loads.

No, I just measure runtime. I recall that profilers was reporting the number of different kinds of transactions, but can’t find it anymore. Doest it work on Fermi?

No, I just measure runtime. I recall that profilers was reporting the number of different kinds of transactions, but can’t find it anymore. Doest it work on Fermi?

Thanks, I see now that I get 4x fewer DRAM reads if using non-cached accesses. So, it must indeed use 4x smaller transactions. However, this doesn’t lead to 4x speedup. It seems that performance in scattered access is bound not by pin bandwidth. But what is the bottleneck then?

Could this be the performance of DRAM chips themselves? Here are the specs for 1Gb GDDR5 chips from hynix: http://www.hynix.com/datasheet/eng/graphics/details/graphics_26_H5GQ1H24AFR.jsp?menu1=01&menu2=04&menu3=05&menuNo=1&m=3&s=4&RK=26 . It says t32AW = 184ns, which means it can’t open more than 32 pages per 184 ns. In scattered access you have to open one page per every access. There are 12 chips on GTX480, so this bounds the aggregate access rate to 32*12 accesses per 184 ns = 2.1 Gaccesses/s. For 64-bit accesses this is 17 GB/s or 9.4% of pin bandwidth. If correct, this clearly is a bottleneck.

Vasily

Thanks, I see now that I get 4x fewer DRAM reads if using non-cached accesses. So, it must indeed use 4x smaller transactions. However, this doesn’t lead to 4x speedup. It seems that performance in scattered access is bound not by pin bandwidth. But what is the bottleneck then?

Could this be the performance of DRAM chips themselves? Here are the specs for 1Gb GDDR5 chips from hynix: http://www.hynix.com/datasheet/eng/graphics/details/graphics_26_H5GQ1H24AFR.jsp?menu1=01&menu2=04&menu3=05&menuNo=1&m=3&s=4&RK=26 . It says t32AW = 184ns, which means it can’t open more than 32 pages per 184 ns. In scattered access you have to open one page per every access. There are 12 chips on GTX480, so this bounds the aggregate access rate to 32*12 accesses per 184 ns = 2.1 Gaccesses/s. For 64-bit accesses this is 17 GB/s or 9.4% of pin bandwidth. If correct, this clearly is a bottleneck.

Vasily