Where is the padding done? On the host or on the device?
The problem is I have an array for which the first three and last three elements of each row are ghost elements and do not need threads to partake in the calculation. I’ve tried several approaches to improve the alignment, including ghost threads.
My question is “Where is the array reformatted and padded in CudaMemcpy2D?” On the host or the device. In particular, will I take a large hit if I reformat and pad the array by hand on the host? If CudaMemcpy2D() does the work on the host, then I should be able to do it on the host w/o a penalty. If the padding is done efficiently in transit, I presumably could not do it as quickly on the host.
If you have intimate knowledge of which elements you need or don’t need, etc, then just use it yourself. Btw, I’m assuming you’re doing all this to get coallescing?