CUDA Fortran and CUDA API 3D Arrays

In my quest to try out asynchronous memcpy with CUDA Fortran, I’ve figured out how to use cudaMalloc/Memcpy and cudaMallocPitch/Memcpy2D. Hooray!

But now I move on to the 3D API calls, and I’m wondering if anyone has any advice with these. Looking at the cuda headers, etc., with PGI, I can see that cudaExtent looks to be a simple TYPE, but I’m wondering if anything special needs to be done for cudaPitchedPtr or cudaMemcpy3DParms? And is there a subroutine similar to make_cudaPitchedPtr?

(This is all to say nothing about the use of these in the kernel!)

Hi Matt,

I’m out of the office so can test it myself, I’ve seen internal mail stating that cudaMalloc3D should be added in 10.6. Once I’m back next week, I can get you further details if you need them.

Thanks,
Mat

Hi TheMatt,
Could you please share some experience on using cudaMallocPitch() (some example) and cudaMalloc3D(), if possible.

Thanks,
Tuan

Tuan,

I suppose so. What I did was pretty simple, though, in the end, it turned out it wasn’t worth it (my case has pretty simple memory access so padding doesn’t help much).

What I did in my test was like this. First in the driver:

integer :: m = 1782
integer :: np = 72

integer :: mnp_pitch, mnp1_pitch
integer :: istat

! Inputs
real, allocatable :: ta(:,:)
real, allocatable, device :: ta_dev(:,:)

!Outputs
real, allocatable :: flx(:,:)
real, allocatable, device :: flx_dev(:,:)

allocate(ta(m,np))
allocate(flx(m,np+1))

istat = cudaMallocPitch(ta_dev,mnp_pitch,m,np)
istat = cudaMallocPitch(flx_dev,mnp1_pitch,m,np+1)

(...initialize ta by reading in from file, say...)

istat = cudaMemcpy2D(ta_dev,mnp_pitch,ta,m,m,np)

call kernel<<<Grid,Block>>>(m,np,mnp_pitch,mnp1_pitch,ta_dev,...,flx_dev,...)

istat = cudaMemcpy2D(flx,m,flx_dev,mnp1_pitch,m,np+1)

What we see is that I use cudaMallocPitch to allocate the pitched memory and get the pitch itself which returns in the second element. As you can see here I was being a bit too careful by having a pitch for both m-by-np and m-by-np+1 arrays. This is overkill I’m pretty sure, but I wanted to make sure I didn’t make a mistake.

The issue that got me was the order of the cudaMemcpy2D. You have to remember to get the src and dst pitches correct. That is, the first two elements are the destination array and pitch followed by the source array and pitch, and then the actual number of elements. Say the mnp_pitch is 1792 (which 32 divides unlike the actual m = 1782, not sure if that’s what it will actually use), you don’t want to use:

istat = cudaMemcpy2D(ta_dev,mnp_pitch,ta,m,mnp_pitch,np)

because you might have allocated (mnp_pitch,np) on the device, but there are still only (m,np) elements, the pitch is just telling it what to skip. (And, of course, on the host the pitch is m.)

Let me know if this does or does not make sense. As I said, I don’t use this currently in my “production” work, but I’m sure soon enough I will, so if I made a mistake, it’d be good to know!

Matt

Hi Matt,

I’m trying to use cudamallocpitch but pgfortran gives me this error:

“PGF90-S-0155-Could not resolve generic procedure cudamallocpitch”

This may be a dumb question, but how did you get your fortran code to compile with the call to cudamallocpitch? Mine looks just like yours:

r = cudaMallocPitch(udev, pitch, nx, ny*nz)

(where udev is an array of floats, and r, pitch, nx, ny, and nz are integers)

Thanks

Jim