It looks to me from the source code that the problem is shared memory bank conflict.
Fermi based GPU have 32 banks, while earlier GPUs have 16 banks. It’s clearly that the program is written with 16 banks in mind. I tested it with Visual Profiler and indeed it has serious amount of bank conflicts.
Fortunately, it’s easy to correct this because Fermi supports larger blocks. Just modify the two defines in the code:
#define ROWS_BLOCKDIM_X 16 // change this to 32
and
#define COLUMNS_BLOCKDIM_X 16 // change this to 32
This is for shared memory version. I can get around 3400 Mpix/s on my factory OC’d 460 (not by much, only @ 715MHz).
I don’t know why the texture version is slower though, as GeForce 460 should have a little higher texture fillrate than 9800GT.