From what I red in the NVIDIA’s white paper, the next-gen GPU architecture is going to increase the number of cores per SM 4x (to 32), but is going to allow only 48kb of on-chip cash to be configured as shared memory. That is, shared memory per core goes down ? That is somewhat bad news, as some applications’ performance is limited by exactly this factor, limiting the number of warps dispatched to the given SM. Any thought on that?
Sure, but usually the amount of required shared memory for some algorithms is proportional to the number of cores. Let’s say in algorithms involving computations of finite differences (or reductions ), one loads a chunk of an array into shared memory proportional to the number of cores (+ shadows) and each core computes a difference between adjacent elements. When this array elements are composite objects one can easily get to the bounds of shared memory and have only 1 warp active on a SM due to this restriction (I actually encountered this problem). Here we have more memory per block, but if we want to make the code written for GT200 work on Fermi, we should make some of the cores stay idle, as the shared memory is not going to be big enough to get a sufficient chunk of array in it, as a simple scaling would give 16kb/8= 2kb per core for GT200, but only 48kB/32 = 1.5kB per core for Fermi.
In practice I find that my shared memory usage is proportional to my block size, rather than the number of cores. (That is to say, I pay little attention to the number of cores in a SM when designing an algorithm.) One thing I am worried about is whether keeping 32 cores per SM busy will require bigger blocks than I currently use. If that is true, then we run into the problem you note. If a block size of 128 or 192 still runs efficiently on the Fermi, then this could be a net win.