With few exceptions, 64-bit integer operations on the GPU are emulated through multiple native instructions. In CUDA, the address size used by device code matches the address size of the host. Moving CUDA code to a 64-bit platform makes all addresses (and therefore pointers) 64-bit quantities, and in this way introduces 64-bit arithmetic for address computations.
The compiler tries to be smart about it, trying to avoid full 64-bit computation where possible, but in general you will observe an increase in register usage (as each register can hold only 32 bits of data) and an increase in dynamic instruction count. The increase in register usage can lead to decreased occupancy which can lead to lower performance. In your case that does not seem to happen (but you may want to double check with the profiler), but the increase in the number of instructions to be executed leads to a slowdown.
Not knowing anything about the particulars of the code, I can only say that in my experience a 20% drop is within the range seen with transitions from 32-bit to 64-bit platforms. You may want to dig in with the profiler to see where slight modifications to the code could restore some of the lost performance. A 2D stencil operation seems to imply that the code could be memory bandwidth limited rather than instruction throughput limited, which in turn would suggest a smaller performance drop from moving to a 64-bit platform. On modern GPU like the GTX 660 the profiler has ample HW counters at its disposal to pinpoint what is impacting performance, so I would suggest spending some time with it.