CUDA fortran pointer and CUF kernel


I am using CUDA fortran with multi-GPU device. In order to put parameters on different GPU, I define a type like this,

	module mul_dev
		! distributed arrays
		type deviceArray
         		real* 8, device, allocatable :: den_dev(:,:,:),rhou_dev(:,:,:)     
		end type deviceArray
		type (deviceArray), device, pointer :: dev_ptr
	end module mul_dev

and in the program define

type (deviceArray),target, allocatable :: dev(:)  ! (1:nDevices)

However, when I try to use these parameters in CFU kernel,

      do isub = 1, ndomainM

        istart = refIDM(isub,1)
        jstart = refIDM(isub,2)
        kstart = refIDM(isub,3)
        iend   = refIDM(isub,4)+istart-1
        jend   = refIDM(isub,5)+jstart-1
        kend   = refIDM(isub,6)+kstart-1
		dev_ptr => dev(isub);

!$cuf kernel do(3) <<<*,*,stream=streamID(isub)>>>
        do k=kstart,kend-1
	do j=jstart,jend-1
        do i=istart,iend
     &                  		   dev_ptr%den_dev(i+1,j+1,k+1))/2d0
		write(*,*) "after multi domain in device "
		write(*,*) isub, cudaGetErrorString(ierr);


I got error information:

misaligned address

Do you have any suggestions?

Hi cofludy,

First, “dev” is a host array so pointing “dev_ptr” to “dev(iSub)” is pointing to a host array not available on the device. Instead you can two device pointers which point to the data members and then use these pointers in the CUF kernel. Be sure when you’re allocating the data members, that you have set the device so the data is allocated on the correct device.

Alternatively, you might try changing the data members to use “managed” instead of “device”, and then also add “managed” to the “dev” array. This way, the CUDA driver will automatically move the data to the correct device when accessed.

A third option would be to use CUDA Peer-to-peer communication so that one device can read data from another device. Though I don’t have much experience with this and you’d need a newer Tesla device (P100 or V100) with NVlink between the devices in order for it to be performant.

I typically recommend using MPI+CUDA Fortran when doing multi-gpu programming. The logic is much simpler than trying to create arrays of data, one per device, as you do here.


hi mkcog,

Thank you very much for your answer. Your suggestion works.