The usual strategy to mitigate latency is to overlap the transfer of one block of data with the processing of the previous block. I’m not sure what algorithm you are working with, but if you can process one block of data while transferring another one, that could be a useful technique. The GTX 400 cards have one DMA engine, so you should be able to at least overlap transfer in one direction with processing.
If double-buffering is not possible, you can still see some improvement if you can organize your data such that frequently accessed items are in GPU memory and infrequently used items are in pagelocked CPU memory. I used that trick as a fall-back method when working with a large bounding volume hierarchy. I put as many intermediate nodes on the GPU as possible, and kept rest on the CPU. Given the relatively small size of the GPU cache, you probably don’t get many cache hits when trying to read the CPU memory, but I haven’t studied this. I also haven’t studied (for lack of suitable motherboard) if CPU memory access over PCI-Express 3.0 has better latency the PCI-E 2.0.
If it looks like the GTX 460 is too limiting for your use case, the GTX 750 Ti has more compute power, twice the memory, a 4x larger L2 cache, PCI-Express 3.0 and half the power draw for $150.