Some code I was working on got a massive performance boost when run on a K20( Windows 7) by using restrict on 32-bit pointers, but in linux (Ubuntu) on a lesser GPU (GTX 660), there was only a modest performance boost.
Basically had a number of kernels which took in three const input pointers, and a single output pointer, cast them all as restrict, and that made as much as a 3x speedup for larger problems, but only with the K20c. The GTX 660 at best was 10% faster.
I know the K20 has a memory bus width of 320, while the GTX 660 I believe has 192.
In general the budget GTX 660 has been running very fast in Ubuntu, but somehow was less affected.
Anybody have any insight to this issue?