I have two question about shared memory.
CUDA Programing guide specification says (for my hardware)
- Max number of threads per multiprocessor = 768
- 16KB of shared memory per multiprocessor
That means 21.33 Bytes per thread. (Don’t ask me why I need to have shared memory per thread). If my blocks size is of 16x16 (256 threads) means that 3 blocks are putted into one multiprocessor. So I fill the multiprocessor with 768 threads --> 21.33 Bytes per thread. My question is the following. If my block size is 32x16 (512 threads) that means that no more than one block can be putted into one multiprocessor and the shared memory will be shared between 512 -->32 Bytes per thread? Does the nvcc compiler do something to fill the multiprocessor with threads?
The second question is about shared memory “tricks”. I have not enogh shared memory for my purposes. Does anybody have an idea or trick about how to take max benefit of shared memory when you have not enough space? For instance, put a particial into global memory and then recover it by…
Thank you everyone for your time.