32B memory transactions on Fermi?

vvolkov · November 24, 2010, 4:35am

On slide 33 of this presentation - http://www.wasp.uwa.edu.au/home/news?f=282223 - Mark Harris claims that if you use a compiler switch to turn Fermi’s L1 cache off, you can get 32B segment transactions. Could anyone check that it works? I’d be very interested to see a microbenchmark… So far I can see only 128B transactions :(

Vasily

vvolkov · November 24, 2010, 4:35am

On slide 33 of this presentation - http://www.wasp.uwa.edu.au/home/news?f=282223 - Mark Harris claims that if you use a compiler switch to turn Fermi’s L1 cache off, you can get 32B segment transactions. Could anyone check that it works? I’d be very interested to see a microbenchmark… So far I can see only 128B transactions :(

Vasily

mfatica · November 24, 2010, 4:44am

Are you passing the right flag ( -Xptxas -dlcm=cg ) to nvcc?

mfatica · November 24, 2010, 4:44am

Are you passing the right flag ( -Xptxas -dlcm=cg ) to nvcc?

vvolkov · November 24, 2010, 5:28am

I am.

vvolkov · November 24, 2010, 5:28am

I am.

Sarnath · November 24, 2010, 8:24am

Vasily,
How do you see these transactions? Profiler?

Sarnath · November 24, 2010, 8:24am

Vasily,
How do you see these transactions? Profiler?

paulius · November 24, 2010, 5:36pm

It works, see slide 26 of [url=“http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Analysis_Driven_Optimization.pdf”]Page Not Found | NVIDIA

With the official 3.2 driver and profiler you should also be able to see differences in the L2 read requests. The request counters are incremented by 1 for each 32B segment (and also are for the entire chip, since L2 is a “global” resource as opposed to an SM). So, a caching load that misses in L1 will request 4 segments from L2, incrementing that count by 4. So, if you have a scattered access pattern where each thread would be fetching a unique L1 line, L2 read request count for caching-loads should be 4x higher than that for non-caching loads.

paulius · November 24, 2010, 5:36pm

It works, see slide 26 of [url=“http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Analysis_Driven_Optimization.pdf”]Page Not Found | NVIDIA

With the official 3.2 driver and profiler you should also be able to see differences in the L2 read requests. The request counters are incremented by 1 for each 32B segment (and also are for the entire chip, since L2 is a “global” resource as opposed to an SM). So, a caching load that misses in L1 will request 4 segments from L2, incrementing that count by 4. So, if you have a scattered access pattern where each thread would be fetching a unique L1 line, L2 read request count for caching-loads should be 4x higher than that for non-caching loads.

vvolkov · November 26, 2010, 8:46pm

No, I just measure runtime. I recall that profilers was reporting the number of different kinds of transactions, but can’t find it anymore. Doest it work on Fermi?

vvolkov · November 26, 2010, 8:46pm

No, I just measure runtime. I recall that profilers was reporting the number of different kinds of transactions, but can’t find it anymore. Doest it work on Fermi?

vvolkov · November 27, 2010, 8:33am

Thanks, I see now that I get 4x fewer DRAM reads if using non-cached accesses. So, it must indeed use 4x smaller transactions. However, this doesn’t lead to 4x speedup. It seems that performance in scattered access is bound not by pin bandwidth. But what is the bottleneck then?

Could this be the performance of DRAM chips themselves? Here are the specs for 1Gb GDDR5 chips from hynix: http://www.hynix.com/datasheet/eng/graphics/details/graphics_26_H5GQ1H24AFR.jsp?menu1=01&menu2=04&menu3=05&menuNo=1&m=3&s=4&RK=26 . It says t32AW = 184ns, which means it can’t open more than 32 pages per 184 ns. In scattered access you have to open one page per every access. There are 12 chips on GTX480, so this bounds the aggregate access rate to 32*12 accesses per 184 ns = 2.1 Gaccesses/s. For 64-bit accesses this is 17 GB/s or 9.4% of pin bandwidth. If correct, this clearly is a bottleneck.

Vasily

vvolkov · November 27, 2010, 8:33am

Thanks, I see now that I get 4x fewer DRAM reads if using non-cached accesses. So, it must indeed use 4x smaller transactions. However, this doesn’t lead to 4x speedup. It seems that performance in scattered access is bound not by pin bandwidth. But what is the bottleneck then?

Could this be the performance of DRAM chips themselves? Here are the specs for 1Gb GDDR5 chips from hynix: http://www.hynix.com/datasheet/eng/graphics/details/graphics_26_H5GQ1H24AFR.jsp?menu1=01&menu2=04&menu3=05&menuNo=1&m=3&s=4&RK=26 . It says t32AW = 184ns, which means it can’t open more than 32 pages per 184 ns. In scattered access you have to open one page per every access. There are 12 chips on GTX480, so this bounds the aggregate access rate to 32*12 accesses per 184 ns = 2.1 Gaccesses/s. For 64-bit accesses this is 17 GB/s or 9.4% of pin bandwidth. If correct, this clearly is a bottleneck.

Vasily

Topic		Replies	Views
Memory transaction size CUDA Programming and Performance	1	1741	February 12, 2017
Fermi. Any DRAM-intensive benchmarks ? or any suggestions ? CUDA Programming and Performance	2	2352	April 19, 2012
L1/L2 cache profiling in jetson nano CUDA Programming and Performance cuda , jetson-nano	2	474	January 15, 2024
Cache line size of L1 and L2 CUDA Programming and Performance	3	20925	November 14, 2011
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	8891	June 5, 2010
Pascal L1 cache CUDA Programming and Performance	21	11981	January 20, 2024
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26212	December 2, 2011
Why texture memory is better on Fermi? CUDA Programming and Performance	62	20872	January 28, 2011
Load or L2 Bottleneck? CUDA Programming and Performance	3	1182	April 17, 2017
Device memory in nvidia visual profiler CUDA Programming and Performance	1	731	October 10, 2015

32B memory transactions on Fermi?

Related topics