Optimisation Strategies when running out of shared memory

Hey guys,

So what is my best bet if I cannot fully utilise my GPU because I don’t have enough shared memory to launch enough threads per block? Say each task needs 2 KB and therefore I can run at most 8 threads per block (assuming each thread takes care of one task and 16 KB of shared memory is available per block).

There’s no way to answer without breaking down your problem to understand what it’s doing and how it could be done.

But an immediate observation: if you say each thread needs 2KB of shared memory, your approach doesn’t map at all well to the GPU. You want to use not 8 threads per block (or per SM), but more like
256 or preferably 512, and then run hundreds of those blocks.

Alternatively, take one of your thread’s tasks (and its 2K of data) and figure out how you might use at least 32 threads to solve that problem in parallel. Then you can have each warp of a block work on what you currently have each thread working on. But this gets into the whole “it depends what your problem is” conclusion.