GTX285 slower than 8800GTX - what am I doing wrong?

The CUDA kernel of an application that I originally developed on an 8800GTX runs slower roughly three times slower on a new GTX285 card.

There are a total of 7776 threads and I’ve tried various block sizes without significant time differences. These results are shown in the table below (sorry, for the table I used block count instead of block size… but many result in block counts that are multiples of 32).

Does anyone have any idea why this application would run slower on a GTX285? Please note that I’m using a second GTX285 instead of the display GTX285. Previously, the 8800GTX was the only card in the machine. Also, the original machine was a q6600 while the new machine is a Core i7-920. I’m now using Fedora 10 x64 (gcc 4.3.2) and CUDA 2.2 instead of Windows XP x64 (Visual Studio 2003) and CUDA <= 2.0.

Thank you

Block count Execution time
32 72.471441
35 71.053589
40 67.735944
48 65.366666
61 62.791828
64 65.309020
81 65.528513
96 65.602999
121 64.484207
128 65.788350
160 64.764079
192 64.640170
224 63.845980
242 90.749312
256 87.389512
484 75.329755
968 69.463448
1936 64.760989
3872 66.295171
7776 69.168400