shared memory problems size of shared memory allocated affects execution time?

Did it occur to you that the time for kernel execution increases when simply enlarging the size of the float array residing in shared memory?

My video card is GTX 465. when the size of shared memory allocated is lower than or equal to 16KB, it goes well. However when the size exceeds 16KB but less than the total amoutn of shared memory available, the execution time of kernel increases dramatically even without any changes made to that kernel. I am just so confused about that, which means there will be far less shared memory to use than it should be. External Image External Image

ps: On GTX 465, the total amount of shared memory per block is : 48KB.
My project is complied under VS2008 with cuda 4.0 on winXP( 32bit ).

The strange problem only appears (at least from what i have been experienced) when there is a data load operation from global memory to share memory under thoes conditions described above.

Hi,

This could be the penalty for the decreased occupancy you get.

Suppose that you have 512 threads/block. This means that you could run 3 blocks (48 warps) per multiprocessor. If you increase the shared memory to above 16k you would not be able to run more then 2 blocks (32 warps) per multiprocessor.

Use the CUDA Occupancy calculator to calculate the occupancy per multiprocessor.