Does anyone have an example where they have actually got 8-byte shared memory bandwidth on Kepler? I have a kernel which I suspect is (in places at least) limited by shared memory bandwidth. Having counted all of the shared memory reads/writes and compared with the kernel execution time I’m achieving an average of 540GB/s on a GTX 670 (which has a theoretical maximum of about 1750GB/s). I was wondering if it would be worthwhile trying to use 8-byte transactions. My kernel is mostly convolutions with 3 x 3 x 3 separable filters and it seems like it might be quite similar to the “TTI Reverse Time Migration” case study in “GPU Performance Analysis and Optimization” by Paulius Micikevicius.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
How to calculate shared memory bandwidth? | 5 | 2517 | June 9, 2012 | |
Modify the shared memory size into eight-bytes mode | 1 | 445 | March 5, 2019 | |
Maxwell Shared Memory Bank WIDE | 2 | 981 | February 3, 2016 | |
Why not full occupancy? | 2 | 982 | November 17, 2012 | |
Shared memory | 2 | 6860 | April 14, 2011 | |
Scalability question | 3 | 9128 | June 6, 2009 | |
What's a reasonable memory bandwidth performance to expect? My current maximum is only around 50 | 1 | 633 | July 27, 2010 | |
What is the peak bandwidth between the device memory and the GPU on Xavier platform? | 2 | 376 | October 18, 2021 | |
Host2Device bandwidth, Kepler VS Fermi | 4 | 2094 | July 2, 2012 | |
memory interface width | 1 | 2884 | June 9, 2010 |