Strong typing and memory copy

SarahA · March 16, 2010, 4:58pm

Hello,

I have a fortran 90 code to try with CUDA Fortran (2.3). It has a lot of array aliasing, where they thought it would be nice to refer to double complex 5d arrays sometimes with double reduced-dimension arrays. This was done with argument aliasing, for example:

real array(18, N, 2, 4)
call subr( array )
…
subroutine subr( array )
complex array(3,3,N,2,4)

This doesn’t work with fortran modules, and that seems to be the best way to write sets of CUDA device arrays and routines that use the arrays. There is much whining about formal parameter mismatches, etc.

It doesn’t help that EQUIVALENCE is documented as unsupported for CUDA 2.3 … it does compile without error, but it doesn’t seem to work. Of course, I could have another error.

So, my long winded questions are

Is there a way to relax type conformance on array copies? ( host = device, etc ).
I have tried using the cudaMemcpy() routine, but it fails with errors like:
copyin Memcpy FAILED:4 Perhaps it is because these arrays are multi-dimensional.
The runtime routines are not real well documented. Is there a somewhere more details are provided on things like “cudaMemcpyToArray”? For instance, what are the dstx, dsty parameters?
Is device array equivalence really supposed to work? Section 3.2.1 of the v1.2 CUDA Fortran guide says no.

Thanks,
Sarah

MatColgrove · March 16, 2010, 10:53pm

Hi Sarah,

Is there a way to relax type conformance on array copies? ( host = device, etc ).

Sorry, no. Though, this is not specifically a CUDA Fortran issue. Rather, you would have the same problem with any module.

The work around is to write your CUDA Fortran module using complex, and have “subr” call the kernel. Something like:

module bar

contains

attributes(global) subroutine foo_kernel(x)
complex, device, dimension(*) :: x
i = threadidx%x
x(i) = cmplx(i*2.0-1.0,i*2.0)
return
end

end module bar

subroutine foo(x)
use bar
complex, device, dimension(*) :: x
call foo_kernel<<<1,50>>> (x)
end subroutine foo


program test
use cudafor
interface
 subroutine foo(x)
   real, device, dimension(*) :: x
 end subroutine
end interface
real, allocatable, device, dimension(:) :: a
real ha(100)
allocate(a(100))
ha=-1
a = 0.0
call foo(a)
ha = a
print *,ha(1),ha(2),ha(99),ha(100)
end

It doesn’t help that EQUIVALENCE is documented as unsupported for CUDA 2.3 … it does compile without error, but it doesn’t seem to work. Of course, I could have another error.

The compiler should catch this and give an error if EQUIVALENCE is used. I’ve submitted a problem report (TPR#16726) to have this fixed.

Hope this helps,
Mat

SarahA · March 18, 2010, 6:33pm

Thanks for the idea.

Any pointers on documentation for the CUDA Fortran run time memory copy routines? The manual, even the newest one, is a little thin there. I’ll try matching them up with the CUDA C equivalents to decode the “dstx” and “dsty” arguments and so on.

Perhaps a runtime call to do copies would solve my problem.

MatColgrove · March 18, 2010, 7:50pm

Hi SarahA,

Any pointers on documentation for the CUDA Fortran run time memory copy routines? The manual, even the newest one, is a little thin there. I’ll try matching them up with the CUDA C equivalents to decode the “dstx” and “dsty” arguments and so on.

Yes, the most frequent complaint we get from users is the lack of documentation and examples. It will get better over time.

The other complaint we get is that when we do get questions liek this, we tend to send users to the NVIDIA CUDA C documentation. Which is what I’ll need to do here (sorry). The Fortran “cuda” routines are just calls to the CUDA C versions so the NVIDIA documentation should work. CUDA Toolkit Documentation 12.3 Update 1

Note that textured memory and cuda Arrays aren’t yet supported. So the routines that uses them are only there for CUDA C compatibility.

Mat

SarahA · March 19, 2010, 5:43pm

Well then, at the risk of my questions getting dumber and dumber…
How does one go about specifying and calling CUDA C kernels from a CUDA Fortran program?

Since there is no linker … does one specify the source.cu for ‘pgfortran’ and it just works?

Yes, a lot of these questions will go away with a couple concrete examples being provided.

Sarah

SarahA · March 25, 2010, 3:59pm

It would help more to support EQUIVALENCE. I have tested it, and for the uses I have, it does seem to work. I can understand why device COMMON is not supported, but given all the other limitations EQUIVALENCE would be very nice to have.

SarahA · March 25, 2010, 4:05pm

Well then, at the risk of my questions getting dumber and dumber…
How does one go about specifying and calling CUDA C kernels from a CUDA Fortran program?
Sarah

To answer my own question, here is an example. It was fairly easy, and yes, it was a dumb question.

nvcc -c csub.cu
pgfortran afort.CUF csub.o

afort.CUF

      program afort
      real, dimension(1000) :: this
      real, device, dimension(1000) :: dthis
      integer i

      do i = 1, 1000
         this(i)= i
      enddo

      dthis= this
      call addone( dthis )
      this = dthis
      print *, this
      return
      end

csub.cu

#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>

__global__ void addone_kernel( float *data ) {
  int idx = blockIdx.x*blockDim.x + threadIdx.x;
  if ( idx<1000 ) data[idx] = data[idx] - 1000.0f;
}

extern "C" {

void addone_( float *data ) {
  addone_kernel<<<50,200>>>( data );
  return;
}
}

MatColgrove · March 29, 2010, 10:40pm

Hi Sarah,

To answer my own question, here is an example. It was fairly easy, and yes, it was a dumb question.

It’s actually one of our more common questions, so not dumb at all. Most users make it more difficult then it really is. As your example shows, it’s not much different then standard Fortran to C interoperability. For another example, I wrote an article (Account Login | PGI) that has a CUDA Fortran program calling a CUDA C random number generator.

As for Equivalence, we do have a feature request in for it (TPR#16198). I’ll add a note that more users are asking for it and bump up it’s priority. It’s my understanding that it will be fairly difficult to implement since there isn’t a equivalent method to perform this in C.

Thanks,
Mat

Topic		Replies	Views
CUDA fortran device array parameter Legacy PGI Compilers	5	7466	December 24, 2010
CUDA Fortran host device=device assignment Legacy PGI Compilers	1	8398	April 2, 2010
Device-to-device transfers involving components of derived types Legacy PGI Compilers	9	2803	June 12, 2020
Problem loading data to shared memory Legacy PGI Compilers	12	6841	December 24, 2010
unspecified launch failure Legacy PGI Compilers	7	8966	February 11, 2011
Cuda_fortran_loop Legacy PGI Compilers	0	2536	February 22, 2012
Moving device data nvc, nvc++ and nvfortran	3	822	October 5, 2021
copyout Memcpy failed, unspecified launch failure Legacy PGI Compilers	3	5336	March 9, 2012
copying arrays of structures to GPU causes Loworing error Legacy PGI Compilers	5	4781	April 11, 2011
Cuda_fortran_loop Legacy PGI Compilers	1	2121	February 22, 2012

Strong typing and memory copy

Related topics