__restrict__ , differing results

Some code I was working on got a massive performance boost when run on a K20( Windows 7) by using restrict on 32-bit pointers, but in linux (Ubuntu) on a lesser GPU (GTX 660), there was only a modest performance boost.

Basically had a number of kernels which took in three const input pointers, and a single output pointer, cast them all as restrict, and that made as much as a 3x speedup for larger problems, but only with the K20c. The GTX 660 at best was 10% faster.

I know the K20 has a memory bus width of 320, while the GTX 660 I believe has 192.

In general the budget GTX 660 has been running very fast in Ubuntu, but somehow was less affected.

Anybody have any insight to this issue?

The “ld.global.nc” opcode is only supported on sm_35 or higher devices.

But it might be worthwhile investigating whether you can use linear memory bound to a 1D texture in your case. The addressing is relatively straightforward in texels. The only thing that could be in the way is the maximum 1D texture size limit.