I have a kernel that has to operate on an input array and an output array that are both 1.2GB each. Since my card does not have enough memory to have both of those on it at the same time, I have split the input array into 1/4s, and I run the kernel 4 times. Each time between the kernel, I have to move data TO and FROM the kernel.
I saw in the CUDA best practices, I could potentially use asynchronus memcpy and streams to set up some overlap on my kernels, but my question is whether this will work, since obviously all of my data will not fit. Is there a way to set up a similar pattern so that I can get some overlap of computation and memory moving?
Would I be better off using pinned memory for this problem? Each element in the output array only gets written to once, and each element of the input array gets read a couple of times.
Im not sure what the best way forward is, hoping to hear some suggestions.
Right now, my CUDA implementation as a whole is running slower than my openMP version.
You just need two streams for overlap (three on Tesla if you want to take advantage of the second dma engine). So if you are free to partition your problem arbitrarily, just halve the size once more to allow two streams to fit.
Whether pinned memory is useful here depends on whether your problem is bound by host<->device bandwidth, and whether you can allocate that much pinned memory without adverse effect to the rest of the system.
Tera, good point remembering about the extra speed available on Tesla’s DMA.
Three aside questions now that I’m thinking about it (sorry to invade the thread!), especially since I haven’t experimented with Tesla transfers myself…
If you’re doing a very large async memcopy, why can’t the DRIVER split that into two half-sized DMA transfers itself? Seems like a much easier automatic optimization. Or maybe the driver does this already.
Do the dual DMA units actually help net transfers of large data at all? Maybe one big 1.2GB transfer alone would saturate the PCIE bus… and the dual DMA could be useful for multiple SMALL transfers which are dominated by the overhead and setup.
Do the dual DMA units help zero-copy speeds as well?
I believe the two DMA engines are only good if they operate in opposite directions. Otherwise they’d just contend for the same PCIe bandwidth.
So splitting large transfers should not help at all.
Whether dual DMA engines can help with multiple small transfers in the same direction I have no idea.
I don’t see how the DMA units could help zero-copy, but the opposite should be true: If you don’t have a Tesla, you can do one transfer with zer-copy and the other by DMA. I seem to remember that a while ago somebody from Nvidia (probably Tim Murray) confirmed this.
If you can fit your 1.2GB input data on the card, you can use zero-copy only for your output. If all the input data can’t fit at once… then you can start playing splitting games, (or try to use zero-copy for that too, but that will be a bigger bottleneck.)
A nice side benefit when using zero-copy for your output is the pretty easy extension to multi-GPU. Each GPU gets the same data onboard, but then you just have different GPUs save the output data via zero-copy and you end up with the final answer already assembled on your host.