Strong typing and memory copy

Hello,

I have a fortran 90 code to try with CUDA Fortran (2.3). It has a lot of array aliasing, where they thought it would be nice to refer to double complex 5d arrays sometimes with double reduced-dimension arrays. This was done with argument aliasing, for example:

real array(18, N, 2, 4)
call subr( array )

subroutine subr( array )
complex array(3,3,N,2,4)

This doesn’t work with fortran modules, and that seems to be the best way to write sets of CUDA device arrays and routines that use the arrays. There is much whining about formal parameter mismatches, etc.

It doesn’t help that EQUIVALENCE is documented as unsupported for CUDA 2.3 … it does compile without error, but it doesn’t seem to work. Of course, I could have another error.

So, my long winded questions are

  1. Is there a way to relax type conformance on array copies? ( host = device, etc ).

  2. I have tried using the cudaMemcpy() routine, but it fails with errors like:
    copyin Memcpy FAILED:4 Perhaps it is because these arrays are multi-dimensional.

  3. The runtime routines are not real well documented. Is there a somewhere more details are provided on things like “cudaMemcpyToArray”? For instance, what are the dstx, dsty parameters?

  4. Is device array equivalence really supposed to work? Section 3.2.1 of the v1.2 CUDA Fortran guide says no.


    Thanks,
    Sarah

Hi Sarah,

  1. Is there a way to relax type conformance on array copies? ( host = device, etc ).

Sorry, no. Though, this is not specifically a CUDA Fortran issue. Rather, you would have the same problem with any module.

The work around is to write your CUDA Fortran module using complex, and have “subr” call the kernel. Something like:

module bar

contains

attributes(global) subroutine foo_kernel(x)
complex, device, dimension(*) :: x
i = threadidx%x
x(i) = cmplx(i*2.0-1.0,i*2.0)
return
end

end module bar

subroutine foo(x)
use bar
complex, device, dimension(*) :: x
call foo_kernel<<<1,50>>> (x)
end subroutine foo


program test
use cudafor
interface
 subroutine foo(x)
   real, device, dimension(*) :: x
 end subroutine
end interface
real, allocatable, device, dimension(:) :: a
real ha(100)
allocate(a(100))
ha=-1
a = 0.0
call foo(a)
ha = a
print *,ha(1),ha(2),ha(99),ha(100)
end



It doesn’t help that EQUIVALENCE is documented as unsupported for CUDA 2.3 … it does compile without error, but it doesn’t seem to work. Of course, I could have another error.

The compiler should catch this and give an error if EQUIVALENCE is used. I’ve submitted a problem report (TPR#16726) to have this fixed.

Hope this helps,
Mat

Thanks for the idea.

Any pointers on documentation for the CUDA Fortran run time memory copy routines? The manual, even the newest one, is a little thin there. I’ll try matching them up with the CUDA C equivalents to decode the “dstx” and “dsty” arguments and so on.

Perhaps a runtime call to do copies would solve my problem.

Hi SarahA,

Any pointers on documentation for the CUDA Fortran run time memory copy routines? The manual, even the newest one, is a little thin there. I’ll try matching them up with the CUDA C equivalents to decode the “dstx” and “dsty” arguments and so on.

Yes, the most frequent complaint we get from users is the lack of documentation and examples. It will get better over time.

The other complaint we get is that when we do get questions liek this, we tend to send users to the NVIDIA CUDA C documentation. Which is what I’ll need to do here (sorry). The Fortran “cuda” routines are just calls to the CUDA C versions so the NVIDIA documentation should work. http://developer.download.nvidia.com/compute/cuda/3_0-Beta1/toolkit/docs/online/group__CUDART__MEMORY.html

Note that textured memory and cuda Arrays aren’t yet supported. So the routines that uses them are only there for CUDA C compatibility.

  • Mat

Well then, at the risk of my questions getting dumber and dumber…
How does one go about specifying and calling CUDA C kernels from a CUDA Fortran program?

Since there is no linker … does one specify the source.cu for ‘pgfortran’ and it just works?

Yes, a lot of these questions will go away with a couple concrete examples being provided.

Sarah

It would help more to support EQUIVALENCE. I have tested it, and for the uses I have, it does seem to work. I can understand why device COMMON is not supported, but given all the other limitations EQUIVALENCE would be very nice to have.

Well then, at the risk of my questions getting dumber and dumber…
How does one go about specifying and calling CUDA C kernels from a CUDA Fortran program?
Sarah

To answer my own question, here is an example. It was fairly easy, and yes, it was a dumb question.

nvcc -c csub.cu
pgfortran afort.CUF csub.o

afort.CUF

      program afort
      real, dimension(1000) :: this
      real, device, dimension(1000) :: dthis
      integer i

      do i = 1, 1000
         this(i)= i
      enddo

      dthis= this
      call addone( dthis )
      this = dthis
      print *, this
      return
      end

csub.cu

#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>

__global__ void addone_kernel( float *data ) {
  int idx = blockIdx.x*blockDim.x + threadIdx.x;
  if ( idx<1000 ) data[idx] = data[idx] - 1000.0f;
}

extern "C" {

void addone_( float *data ) {
  addone_kernel<<<50,200>>>( data );
  return;
}
}

Hi Sarah,

To answer my own question, here is an example. It was fairly easy, and yes, it was a dumb question.

It’s actually one of our more common questions, so not dumb at all. Most users make it more difficult then it really is. As your example shows, it’s not much different then standard Fortran to C interoperability. For another example, I wrote an article (http://www.pgroup.com/lit/articles/insider/v2n1a4.htm) that has a CUDA Fortran program calling a CUDA C random number generator.

As for Equivalence, we do have a feature request in for it (TPR#16198). I’ll add a note that more users are asking for it and bump up it’s priority. It’s my understanding that it will be fairly difficult to implement since there isn’t a equivalent method to perform this in C.

Thanks,
Mat