10 MB of shared memory

I’ve used CUDA for some time and like it very much. I though feel limited by the small size of the shared memory, you can’t do that much with 16 KB, Fermi will be better with 48 KB + 16 KB L1 cache but it is still too small. Why is it so hard to get something like 10 MB of shared memory for each SM?

It’s not hard, it’s probably not very beneficial compared to the amount of die-area it will take away from other things. It’s a balance.

How many CPU’s do you see with 10 Mb level1 cache? (You have to compare it to level1 cache)

Have you ever heard of or seen an microprocessor with 300Mb of on die L1 cache? Thought not. That is why there is only 16kb or shared memory and 64kb of register memory per MP on the current GT200. Memory takes up a lot of die area. The whole GPU philosophy is built around lots of cores, little if any cache and very fast memory, What you want is almost completely orthogonal to that.

I agree, however, it is too small for some numerical applications, why not at least 64kB shared memory and 128kB of register memory per MP for a GT200 like GPU, or more than ( 48 KB + 16 KB ) for a Fermi?

I’d rather have more SMs (shader multiprocessors). Go build your own GPU ;)

I would prefer to be able to use all 64KB of Fermi’s shared memory as shared memory and get rid of L1 cache whatsoever.

I would like to have larger register file, sometimes one thread may use many registers such that number of active threads per SM is less than 192, then performance is bad since it cannot hide pipeline latency.

or number of active threads per SM is less than 256, this number cannot hide latency of shared memory

Yeah, 64(0)k ought to be enough for anybody.

It isn’t impossible, but expensive.

The obvious solution is of course to use the full time/space continuum, go explore the world for a year, save a whale or two - and then when you are done doing that, come back here and you will be able to buy everything you ever dreamed of for mere pennies.


If we start a democratic voting process on future GPU generation’s feature sets, then we might end up with

something like Homer Simpson’s car

I trust nVidia that they hit the right balance between price, performance and power consumption.

Yes power consumption is a big issue that doesn’t seem to be something that ppl discuss too much around here…

If Nvidia wants to steal more market share from say FPGAs the GFLOPs/watt ratio needs to become a bit more favorable. For some applications you will have FPGAs doing an effective 7-8 GFLOPs/watt (which is what people often look at in that business), this is hard to reach on many graphics cards… The next gen. however is getting up to between 10-18 GFLOPs/watt ( theoretical performance ) which would mean that you might reach good effective FLOPs/watt ratios…