Would it be a better idea to store both of the input matrices into shared memory given 1) increase in shared memory in Fermi and 2) higher ratio between DP peak performance number and memory bandwidth in Fermi compared to pre-Fermi. Any initial findings?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
No performance inprovement shared mem x global mem | 5 | 1157 | April 26, 2013 | |
4 questions about Fermi GPU | 1 | 6154 | May 3, 2011 | |
4 questions about Fermi GPU | 0 | 634 | May 3, 2011 | |
How to optimize for cache + shared memory on Fermi? | 8 | 3038 | April 25, 2010 | |
why shared vs. global mem speedup degrades? | 1 | 1603 | August 19, 2008 | |
Memory use | 1 | 580 | August 24, 2015 | |
Fermi cache performance L1 vs L2 cache | 0 | 780 | May 1, 2010 | |
Theoretical peak performance question GF100 can't co-issue instructions can it? | 15 | 3341 | March 3, 2011 | |
Shared Memory Help needed | 1 | 696 | March 25, 2011 | |
How to efficiently use shared memory? | 2 | 1163 | September 29, 2015 |