Static vs dynamic shared memory allocation, performance

We have an issue regarding two different 8800GT GPUs when it comes to performance using static and dynamic shared memory allocation. We changed our implementation from using static memory allocation to dynamic, but the performance changed dramatically on both GPUs.

Here is how the performance changed on the different GPUs:

GPU1: from 5.9Gbit/sec to 4Gbit/sec

GPU2: from 3.5Gbit/sec to 5.1Gbit/sec

We do not understand how the dynamic allocation can change the performance in such a drastic way. It should be mentioned that GPU1 has 512MB ram and a GPU clock of 660MHz and Memory clock of 950MHz, while GPU2 has 256MB ram with 600 MHz GPU clock and 700 MHz memory clock

The kernels have not been altered in the implementations.

Here is the code we have changed:

__shared__ uint32_t state[4*NUM_THREADS]; ----> extern __shared__  uint32_t state[];

kernel <<<blocks, num_threads>>>  ---->  kernel<<<blocks, num_threads,16*num_threads>>>

There should not be any performance difference between static and dynamic allocation of shared memory. Can you post a simple code example that demonstrates the problem? Thanks!

I have sent you a pm with a link to the code, as it’s a bit big to paste here. You can summarize your findings here if you want

turns out this issue was caused by the fact that GPU1 was running on a 32 bit Linux, and GPU2 64 bit.

However, the static implementation is still slower on both, most likely cause shared memory allocation is done better runtime than by the compiler?