Where is the padding done? On the host or on the device?
The problem is I have an array for which the first three and last three elements of each row are ghost elements and do not need threads to partake in the calculation. I’ve tried several approaches to improve the alignment, including ghost threads.
My question is “Where is the array reformatted and padded in CudaMemcpy2D?” On the host or the device. In particular, will I take a large hit if I reformat and pad the array by hand on the host? If CudaMemcpy2D() does the work on the host, then I should be able to do it on the host w/o a penalty. If the padding is done efficiently in transit, I presumably could not do it as quickly on the host.
I think it does it on the Host. It just knows the right padding for the card, which might vary from hardware to hardware. I’m not sure what it does, but I think it just staggers the rows so that column-down accesses don’t all pound the same memory channel. See here: http://forums.nvidia.com/index.php?showtop…ndpost&p=359400
If you have intimate knowledge of which elements you need or don’t need, etc, then just use it yourself. Btw, I’m assuming you’re doing all this to get coallescing?