Thinking about porting some meshing code I wrote, but memory concerns me...

Hey All,

I have a GTX 460 which is limited to like 1023 MB of RAM on-board the card itself. The software I write is supposed to scale to be massively huge, spanning much more than just 1 GB.

Would it be a smart choice to port my code to CUDA? My code scales well with parallelization (the more threads the better) but I’m concerned that the latency between the GPU and system RAM might offset any sort of parallelization benefits.

In your experience, has this been the case?

The usual strategy to mitigate latency is to overlap the transfer of one block of data with the processing of the previous block. I’m not sure what algorithm you are working with, but if you can process one block of data while transferring another one, that could be a useful technique. The GTX 400 cards have one DMA engine, so you should be able to at least overlap transfer in one direction with processing.

If double-buffering is not possible, you can still see some improvement if you can organize your data such that frequently accessed items are in GPU memory and infrequently used items are in pagelocked CPU memory. I used that trick as a fall-back method when working with a large bounding volume hierarchy. I put as many intermediate nodes on the GPU as possible, and kept rest on the CPU. Given the relatively small size of the GPU cache, you probably don’t get many cache hits when trying to read the CPU memory, but I haven’t studied this. I also haven’t studied (for lack of suitable motherboard) if CPU memory access over PCI-Express 3.0 has better latency the PCI-E 2.0.

If it looks like the GTX 460 is too limiting for your use case, the GTX 750 Ti has more compute power, twice the memory, a 4x larger L2 cache, PCI-Express 3.0 and half the power draw for $150.

If it wouldn’t be too much trouble, do you have any example code (pseudo-code will do nicely, as well) of processing one block of data while transferring another one?