Tuan,
I suppose so. What I did was pretty simple, though, in the end, it turned out it wasn’t worth it (my case has pretty simple memory access so padding doesn’t help much).
What I did in my test was like this. First in the driver:
integer :: m = 1782
integer :: np = 72
integer :: mnp_pitch, mnp1_pitch
integer :: istat
! Inputs
real, allocatable :: ta(:,:)
real, allocatable, device :: ta_dev(:,:)
!Outputs
real, allocatable :: flx(:,:)
real, allocatable, device :: flx_dev(:,:)
allocate(ta(m,np))
allocate(flx(m,np+1))
istat = cudaMallocPitch(ta_dev,mnp_pitch,m,np)
istat = cudaMallocPitch(flx_dev,mnp1_pitch,m,np+1)
(...initialize ta by reading in from file, say...)
istat = cudaMemcpy2D(ta_dev,mnp_pitch,ta,m,m,np)
call kernel<<<Grid,Block>>>(m,np,mnp_pitch,mnp1_pitch,ta_dev,...,flx_dev,...)
istat = cudaMemcpy2D(flx,m,flx_dev,mnp1_pitch,m,np+1)
What we see is that I use cudaMallocPitch to allocate the pitched memory and get the pitch itself which returns in the second element. As you can see here I was being a bit too careful by having a pitch for both m-by-np and m-by-np+1 arrays. This is overkill I’m pretty sure, but I wanted to make sure I didn’t make a mistake.
The issue that got me was the order of the cudaMemcpy2D. You have to remember to get the src and dst pitches correct. That is, the first two elements are the destination array and pitch followed by the source array and pitch, and then the actual number of elements. Say the mnp_pitch is 1792 (which 32 divides unlike the actual m = 1782, not sure if that’s what it will actually use), you don’t want to use:
istat = cudaMemcpy2D(ta_dev,mnp_pitch,ta,m,mnp_pitch,np)
because you might have allocated (mnp_pitch,np) on the device, but there are still only (m,np) elements, the pitch is just telling it what to skip. (And, of course, on the host the pitch is m.)
Let me know if this does or does not make sense. As I said, I don’t use this currently in my “production” work, but I’m sure soon enough I will, so if I made a mistake, it’d be good to know!
Matt