I have a kernel that has to operate on an input array and an output array that are both 1.2GB each. Since my card does not have enough memory to have both of those on it at the same time, I have split the input array into 1/4s, and I run the kernel 4 times. Each time between the kernel, I have to move data TO and FROM the kernel.
I saw in the CUDA best practices, I could potentially use asynchronus memcpy and streams to set up some overlap on my kernels, but my question is whether this will work, since obviously all of my data will not fit. Is there a way to set up a similar pattern so that I can get some overlap of computation and memory moving?
Would I be better off using pinned memory for this problem? Each element in the output array only gets written to once, and each element of the input array gets read a couple of times.
Im not sure what the best way forward is, hoping to hear some suggestions.
Right now, my CUDA implementation as a whole is running slower than my openMP version.