Not enough shared mem

nachovall · November 3, 2009, 12:16pm

Hi all!

I have two question about shared memory.

CUDA Programing guide specification says (for my hardware)

Max number of threads per multiprocessor = 768
16KB of shared memory per multiprocessor

That means 21.33 Bytes per thread. (Don’t ask me why I need to have shared memory per thread). If my blocks size is of 16x16 (256 threads) means that 3 blocks are putted into one multiprocessor. So I fill the multiprocessor with 768 threads → 21.33 Bytes per thread. My question is the following. If my block size is 32x16 (512 threads) that means that no more than one block can be putted into one multiprocessor and the shared memory will be shared between 512 -->32 Bytes per thread? Does the nvcc compiler do something to fill the multiprocessor with threads?

The second question is about shared memory “tricks”. I have not enogh shared memory for my purposes. Does anybody have an idea or trick about how to take max benefit of shared memory when you have not enough space? For instance, put a particial into global memory and then recover it by…

Thank you everyone for your time.

MisterAnderson42 · November 3, 2009, 12:19pm

If you need more shared memory bytes per thread, then simply don’t run the maximum number of threads on an MP. Run a block of 64 threads and now you have 256 bytes/thread.

SPWorley · November 3, 2009, 12:27pm

Shared memory isn’t assigned per thread… it’s shared among the whole block.

When you launch your kernel, you tell it how many threads and how much dynamic shared memory you want each block to have.
The register use and static shared memory is build into the kernel.
The driver then decides how many blocks can simultaneously run on a single SM. This may be limited by thread count, by register use, or by shared memory use.

So, answering your question, if you have 512 threads in your block on your SM 1.0/1.1 (max 768 threads per SM) hardware, you can only run one block per SM, regardless of shared memory use.

When a block is launched, it gets exactly how much shared memory you specified for it. You don’t get “extra” shared memory because the driver had some left over to pass around. If you really wanted to optimize this, you could do it yourself, likely starting with the occupancy tool.

Rememeber also that you don’t get all 16K of shared memory. A few words are used for dynamic variables like threadIdx, and more are used for kernel arguments.

nachovall · November 3, 2009, 3:49pm

Still few memory. But thanks anyway.

nachovall · November 3, 2009, 3:56pm

I know that is not per thread but I need that each position into my grid (each thread) stores a vector of k elements. Each thread acces to one position in shared memory. That’s because I say “shared memory per thread”.

Very nice tool. Thanks.

I didn’t know. Thanks.

MisterAnderson42 · November 3, 2009, 4:01pm

Then use global and/or local memory, or wait for Fermi which has 64k split between L1 cache and shared memory. If you program your app with global/local memory now (with likely slow performance) it will then automatically benefit from the L1 cache on Fermi without any code changes.

Topic		Replies	Views
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8350	February 12, 2008
Maximising memory per thread CUDA Programming and Performance	4	3274	May 3, 2010
shared memory and CUDA calculator CUDA Programming and Performance	6	4041	October 26, 2008
The choose of grid size and block size CUDA Programming and Performance	8	3141	May 8, 2024
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	3986	August 14, 2007
shared memory usage per Block VS per SM CUDA Programming and Performance	2	8544	May 3, 2010
Execution Of Thread-Blocks CUDA Programming and Performance	4	5282	June 18, 2007
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27725	February 15, 2010
shared memory allocation among thread blocks CUDA Programming and Performance	3	1845	March 3, 2008
Shared Memory Buffer CUDA Programming and Performance	1	2688	May 13, 2011

Not enough shared mem

Related topics