The background is that, assuming Fermi doesn’t add cache, GPU traditionally rely on large number of concurrent threads execution to hide memory access latency. For the previous GT200 architecture, each SM contains 8 cores that execute a single primitive instruction in 4 cycles. We assume a SM maintains 1024 active threads, i.e., 32 warps, it needs 600/4/32 = 4.69 arithmetic instructions per load/store to hide 600 cycles of memory access latency.
Now come back to Fermi, which increased the number of cores from 8 to 32 for one SM, and 1 warp is executed in each cycle (actually 2 warps in 2 cycles due to duel scheduler). One SM maintains 1.5K active threads (48 warps), then it will need 600/48 = 12.5 arithmetic instructions to hide 600 cycles of memory latency (in other words, the same 4.69 ratio application will have 2/3 time waiting)… This seems to be a backward hit for many applications.
I also noticed that Fermi has added L1/SM and L2/Unified cache to address memory latency issues. 48KB L1 cache averages 32Byte for 1.5K threads, while 768KB unified L2 cache also averages to 32Byte per active thread. Performance could suffer in applications with a lot of cache cold miss (with compute bubbles idling on a 600 cycle latency). Meanwhile, in old GT200 architecture, these applications may have enough arithmetic/mem ratio to complete resolve those bubbles.
My current feeling is that, while most applications will benefit from cache and more concurrent cores, some may get negative impact.
I’d like to hear what you say about this ;-)