CUDA fortran pointer and CUF kernel

Hi,

I am using CUDA fortran with multi-GPU device. In order to put parameters on different GPU, I define a type like this,


	module mul_dev
		
		! distributed arrays
		type deviceArray
         		real* 8, device, allocatable :: den_dev(:,:,:),rhou_dev(:,:,:)     
		end type deviceArray
		
		type (deviceArray), device, pointer :: dev_ptr
		
	end module mul_dev

and in the program define

type (deviceArray),target, allocatable :: dev(:)  ! (1:nDevices)

However, when I try to use these parameters in CFU kernel,

      do isub = 1, ndomainM
	  
		ierr=cudaSetDevice(deviceIDM(isub));

        istart = refIDM(isub,1)
        jstart = refIDM(isub,2)
        kstart = refIDM(isub,3)
        iend   = refIDM(isub,4)+istart-1
        jend   = refIDM(isub,5)+jstart-1
        kend   = refIDM(isub,6)+kstart-1
		
		dev_ptr => dev(isub);

!$cuf kernel do(3) <<<*,*,stream=streamID(isub)>>>
        do k=kstart,kend-1
	do j=jstart,jend-1
        do i=istart,iend
           dev_ptr%rhou_dev(i,j,k)=(dev_ptr%den_dev(i,j+1,k+1)+
     &                  		   dev_ptr%den_dev(i+1,j+1,k+1))/2d0
        enddo
        enddo
        enddo
		
		write(*,*) "after multi domain in device "
		ierr=cudaThreadSynchronize();
		ierr=cudaGetLastError();
		write(*,*) isub, cudaGetErrorString(ierr);

        enddo

I got error information:


misaligned address

Do you have any suggestions?

Hi cofludy,

First, “dev” is a host array so pointing “dev_ptr” to “dev(iSub)” is pointing to a host array not available on the device. Instead you can two device pointers which point to the data members and then use these pointers in the CUF kernel. Be sure when you’re allocating the data members, that you have set the device so the data is allocated on the correct device.

Alternatively, you might try changing the data members to use “managed” instead of “device”, and then also add “managed” to the “dev” array. This way, the CUDA driver will automatically move the data to the correct device when accessed.

A third option would be to use CUDA Peer-to-peer communication so that one device can read data from another device. Though I don’t have much experience with this and you’d need a newer Tesla device (P100 or V100) with NVlink between the devices in order for it to be performant.

I typically recommend using MPI+CUDA Fortran when doing multi-gpu programming. The logic is much simpler than trying to create arrays of data, one per device, as you do here.

-Mat

hi mkcog,

Thank you very much for your answer. Your suggestion works.

Cofludy