Not enough shared mem

Hi all!

I have two question about shared memory.

CUDA Programing guide specification says (for my hardware)

  • Max number of threads per multiprocessor = 768
  • 16KB of shared memory per multiprocessor

That means 21.33 Bytes per thread. (Don’t ask me why I need to have shared memory per thread). If my blocks size is of 16x16 (256 threads) means that 3 blocks are putted into one multiprocessor. So I fill the multiprocessor with 768 threads → 21.33 Bytes per thread. My question is the following. If my block size is 32x16 (512 threads) that means that no more than one block can be putted into one multiprocessor and the shared memory will be shared between 512 -->32 Bytes per thread? Does the nvcc compiler do something to fill the multiprocessor with threads?

The second question is about shared memory “tricks”. I have not enogh shared memory for my purposes. Does anybody have an idea or trick about how to take max benefit of shared memory when you have not enough space? For instance, put a particial into global memory and then recover it by…

Thank you everyone for your time.

If you need more shared memory bytes per thread, then simply don’t run the maximum number of threads on an MP. Run a block of 64 threads and now you have 256 bytes/thread.

Shared memory isn’t assigned per thread… it’s shared among the whole block.

When you launch your kernel, you tell it how many threads and how much dynamic shared memory you want each block to have.
The register use and static shared memory is build into the kernel.
The driver then decides how many blocks can simultaneously run on a single SM. This may be limited by thread count, by register use, or by shared memory use.

So, answering your question, if you have 512 threads in your block on your SM 1.0/1.1 (max 768 threads per SM) hardware, you can only run one block per SM, regardless of shared memory use.

When a block is launched, it gets exactly how much shared memory you specified for it. You don’t get “extra” shared memory because the driver had some left over to pass around. If you really wanted to optimize this, you could do it yourself, likely starting with the occupancy tool.

Rememeber also that you don’t get all 16K of shared memory. A few words are used for dynamic variables like threadIdx, and more are used for kernel arguments.

Still few memory. But thanks anyway.

I know that is not per thread but I need that each position into my grid (each thread) stores a vector of k elements. Each thread acces to one position in shared memory. That’s because I say “shared memory per thread”.

Very nice tool. Thanks.

I didn’t know. Thanks.

Then use global and/or local memory, or wait for Fermi which has 64k split between L1 cache and shared memory. If you program your app with global/local memory now (with likely slow performance) it will then automatically benefit from the L1 cache on Fermi without any code changes.