Three Dimensional Matrices

Hi everyone!

I am currently exploring how to use three dimensional matrices in the kernel but it is not as simple as I thought. According to the CUDA Fortran
Programming Guide and Reference the values of blockidx%z and griddim%z must always be one. Keeping constraints in mind, I know a block cannot have more than 512 threads. If the device will only accept two dimensional matrices, what would be the best way to take a three dimensional matrix and send it to the kernel?

This is the code I am trying to implement:

do m = 1, mba
   do j = 2, nj-1
      do i = 2, ni-1

         PHIN(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
                         + AS(i,j,m) * PHI(i,j-1,m)&
                         + AE(i,j,m) * PHI(i,j+1,m)&
                         + AW(i,j,m) * PHI(i,j-1,m)&
                         + AP(i,j,m) * PHI(i,j+1,m)
      enddo
   enddo
enddo

Any feedback would be helpful since I am new to CUDA Fortran. Thank you for your time!

-Chris

edit: used your variable names

I’m fairly new to cuda fortran as well, so take my advice with a grain of salt. However, would it be possible to declare a one dimensional dimGrid and a two dimensional dimBlock? You could then reference your i,j,k variables with:

i = threadidx%x
j= threadidx%y
m= blockidx%x

Or at least something similar. I’m planning on doing something similar for the project I’m working on. The way I see it, you’re taking your 3D block and dividing it up into slices of 2D planes, referenced by ‘m’. Each cuda block handles one slice, while your 2D array of threads handles each element of the plane, referenced by ‘i’ and ‘j’.

I know this isn’t the same approach the PGI guide takes in its example, so I’m right here with you wondering if this is a legit method.

Thank you!
I will work on it and see what constraints the device gives me.

Sincerely,

Chris

Definitely let me know how it goes. Keep in mind with that technique the max (square) slice size is 22x22 to keep within the 512 thread limit.

Hi Chris,

Assuming “ni” is large, I would make your “m” loop be correspond to “blockid%x” and “j” to “blockid%y”. Next have your kernel execute multiple iterations of i. Something like:

 
m  = blockIdx%x-1
j  =  blockIdx%y  ! adjust for the starting value of 2 
tidx = threadIdx%x+1  ! adjust for the starting value of 2
nthrds = blockDim%x 

do i = tidx, ni-1, nthrds
         PHIN(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
                         + AS(i,j,m) * PHI(i,j-1,m)&
                         + AE(i,j,m) * PHI(i,j+1,m)&
                         + AW(i,j,m) * PHI(i,j-1,m)&
                         + AP(i,j,m) * PHI(i,j+1,m)
      enddo
   enddo
enddo

And your kernel launch would be something like:

    
    type(dim3) :: dimGrid, dimBlock
...
   dimBlock = dim3(512,1,1)   
   dimGrid = dim3(mba,nj-2,1)
   call foo<<dimGrid,dimBlock>>>(arg1)

Hope this helps,
Mat

Hi Mat,

One of my main problems is actually sending a three-dimensional matrix to the device. The GPU does not seem to like it when I allocate the matrix, no matter how small it is.

Currently I am modeling the program based on the matrix multiplication sample code from earlier. In the sample code, the two-dimensional matrices were allocated in the device and data from the host was copied to them before sending them to the kernel.

!Bits and pieces from the sample code:

 real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev
 Adev = A(1:N,1:M)
 allocate( Adev(N,M), Bdev(M,L), Cdev(N,L) )
 call mmul_kernel<<<dimGrid,dimBlock>>>( Adev, Bdev, Cdev, N, M, L )

call foo<<dimGrid,dimBlock>>>(arg1)

How should I send the data to the device?
For example, should I send a matrix called a_device(N,N,N) where you wrote arg1?

Thank you for helping me, you are very helpful.

-Chris

Hi Chris,

The GPU does not seem to like it when I allocate the matrix, no matter how small it is.

What’s the actual error that you’re seeing?

In your sample, you allocate after trying to copy the data over the the GPU. The allocate must come before the copy else there’s no place to copy the data to. Is this error you’re getting?

  • Mat

Hi Mat,

Here is the code I am using to explore 3D Matrices. I copy the data after the allocation to the device and I receive the error:

ConfigureCall FAILED:9

Unless I comment the kernel call out. This code is not mine, it is actually TheMatt’s but his version is for exploring two-dimensional matrices.

module assign_mod
   use cudafor
contains
   attributes(global) subroutine assign_kernel(A, N)
      implicit none

      integer, value :: N
      integer, device, dimension(N,N,N) :: A
      integer, device :: idx, idy
      integer, device :: bx, by, bdimx, bdimy
      integer, device :: tx, ty

      bx = blockidx%x
      bdimx = blockdim%x
      tx = threadidx%x

      by = blockidx%y
      bdimy = blockdim%y
      ty = threadidx%y

     ! bz = blockidx%z
     ! bdimz = blockdim%z
     ! tz = threadidx%z

      idx = (bx-1) * bdimx + tx
      idy = (by-1) * bdimy + ty
     ! idz = (bz-1) * bdimz + tz

      if (idx <= N .and. idy <= N) then
         A(idx,idy,1) = tx
      end if

   end subroutine assign_kernel
end module assign_mod

program main
   use cudafor
   use assign_mod
   implicit none

   integer :: i, j, k
   integer, parameter :: n = 6
   integer, parameter :: blocksize = 2

   integer, allocatable, dimension(:,:,:) :: a_host, b_host
   integer, device, allocatable, dimension(:,:,:) :: a_device

   type(dim3) :: dimGrid, dimBlock

   dimBlock = dim3(blocksize,blocksize,blocksize)
   dimGrid = dim3(n/blocksize,n/blocksize,n/blocksize)
   allocate(a_host(n,n,n))
   allocate(b_host(n,n,n))

   allocate(a_device(n,n,n))           !matrix allocated in device

   forall (i=1:n,j=1:n,k=1:n)
      a_host(i,j,k) = 999
   end forall

   a_device = a_host

   call assign_kernel<<<dimGrid, dimBlock>>> (a_device, n)

    b_host = a_device


   !Show contents of the device array stored in b_host
   do k = 1, n
    do i = 1, n
      do j = 1, n                                          
          if (mod(j,n) == 0) then
           write (*,"(I3)") b_host(i,j,k)                   !Writes Vertically
          else
           write (*,"(I3, 2X)",advance="no") b_host(i,j,k)   !Writes Horizontally
          end if
            !write (*,*) b_host(i,j)
            !write (*,*) j
      end do
    end do
   end do
end program main

Thanks again!

-Chris

Hi Chris,

The third dimension of your grid needs to be 1.

Change:
dimGrid = dim3(n/blocksize,n/blocksize,n/blocksize)

To:
dimGrid = dim3(n/blocksize,n/blocksize,1)

Hope this helps,
Mat

Hi Mat,

Sorry for the late response. Yes, I was able to solve the problem and I know how to place three-dimensional matrices in the device. Because I cannot use a threadidx%z or blockidx%z I ended up using only threadidx%x, blockidx%x and blockidy%y. Thank you for you help!

Sincerely,

-Chris