Did it occur to you that the time for kernel execution increases when simply enlarging the size of the float array residing in shared memory?
My video card is GTX 465. when the size of shared memory allocated is lower than or equal to 16KB, it goes well. However when the size exceeds 16KB but less than the total amoutn of shared memory available, the execution time of kernel increases dramatically even without any changes made to that kernel. I am just so confused about that, which means there will be far less shared memory to use than it should be.
ps: On GTX 465, the total amount of shared memory per block is : 48KB.
My project is complied under VS2008 with cuda 4.0 on winXP( 32bit ).