I wrote a kernel which use shared mem as
volatile shared unsigned int shared;
I found if the size is changed into 2048, the performance will be much worse. There is no other change and the input data is the same.
I went to the histogram256 example, modify the source
volatile shared unsigned int shared[BLOCK_MEM];
volatile shared unsigned int shared[BLOCK_MEM*2];
it is also slower than before.
Will someone tell me the reason or where to read the programming gude.
The most likely reason is that the larger shared memory allocation means fewer blocks can be scheduled simultanously. Simultaneous scheduling is usually more efficient since one stalled block (all warps waiting for high latency reads) won’t stall the SPs themselves, since they have other blocks to work on. The CUDA architecture makes this kind of interleaved scheduling literally free.
If you have a block use so much shared memory that only that one block fits, then you lose some of that flexibility, especially if your code uses syncthreads() a lot. (Syncthreads() is the main reason a whole block stalls.)
Also, when playing with the shared mem size, you’re not compensating by, eg, using more threads per block.