We have an issue regarding two different 8800GT GPUs when it comes to performance using static and dynamic shared memory allocation. We changed our implementation from using static memory allocation to dynamic, but the performance changed dramatically on both GPUs.
Here is how the performance changed on the different GPUs:
GPU1: from 5.9Gbit/sec to 4Gbit/sec
GPU2: from 3.5Gbit/sec to 5.1Gbit/sec
We do not understand how the dynamic allocation can change the performance in such a drastic way. It should be mentioned that GPU1 has 512MB ram and a GPU clock of 660MHz and Memory clock of 950MHz, while GPU2 has 256MB ram with 600 MHz GPU clock and 700 MHz memory clock
The kernels have not been altered in the implementations.
Here is the code we have changed:
__shared__ uint32_t state[4*NUM_THREADS]; ----> extern __shared__ uint32_t state[];
kernel <<<blocks, num_threads>>> ----> kernel<<<blocks, num_threads,16*num_threads>>>