Three Dimensional Matrices

rotteweiler · July 1, 2010, 2:41pm

Hi everyone!

I am currently exploring how to use three dimensional matrices in the kernel but it is not as simple as I thought. According to the CUDA Fortran
Programming Guide and Reference the values of blockidx%z and griddim%z must always be one. Keeping constraints in mind, I know a block cannot have more than 512 threads. If the device will only accept two dimensional matrices, what would be the best way to take a three dimensional matrix and send it to the kernel?

This is the code I am trying to implement:

do m = 1, mba
   do j = 2, nj-1
      do i = 2, ni-1

         PHIN(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
                         + AS(i,j,m) * PHI(i,j-1,m)&
                         + AE(i,j,m) * PHI(i,j+1,m)&
                         + AW(i,j,m) * PHI(i,j-1,m)&
                         + AP(i,j,m) * PHI(i,j+1,m)
      enddo
   enddo
enddo

Any feedback would be helpful since I am new to CUDA Fortran. Thank you for your time!

-Chris

rmsivley · July 1, 2010, 3:10pm

edit: used your variable names

I’m fairly new to cuda fortran as well, so take my advice with a grain of salt. However, would it be possible to declare a one dimensional dimGrid and a two dimensional dimBlock? You could then reference your i,j,k variables with:

i = threadidx%x
j= threadidx%y
m= blockidx%x

Or at least something similar. I’m planning on doing something similar for the project I’m working on. The way I see it, you’re taking your 3D block and dividing it up into slices of 2D planes, referenced by ‘m’. Each cuda block handles one slice, while your 2D array of threads handles each element of the plane, referenced by ‘i’ and ‘j’.

I know this isn’t the same approach the PGI guide takes in its example, so I’m right here with you wondering if this is a legit method.

rotteweiler · July 1, 2010, 4:03pm

Thank you!
I will work on it and see what constraints the device gives me.

Sincerely,

Chris

rmsivley · July 1, 2010, 4:13pm

Definitely let me know how it goes. Keep in mind with that technique the max (square) slice size is 22x22 to keep within the 512 thread limit.

MatColgrove · July 1, 2010, 10:24pm

Hi Chris,

Assuming “ni” is large, I would make your “m” loop be correspond to “blockid%x” and “j” to “blockid%y”. Next have your kernel execute multiple iterations of i. Something like:

 
m  = blockIdx%x-1
j  =  blockIdx%y  ! adjust for the starting value of 2 
tidx = threadIdx%x+1  ! adjust for the starting value of 2
nthrds = blockDim%x 

do i = tidx, ni-1, nthrds
         PHIN(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
                         + AS(i,j,m) * PHI(i,j-1,m)&
                         + AE(i,j,m) * PHI(i,j+1,m)&
                         + AW(i,j,m) * PHI(i,j-1,m)&
                         + AP(i,j,m) * PHI(i,j+1,m)
      enddo
   enddo
enddo

And your kernel launch would be something like:

    
    type(dim3) :: dimGrid, dimBlock
...
   dimBlock = dim3(512,1,1)   
   dimGrid = dim3(mba,nj-2,1)
   call foo<<dimGrid,dimBlock>>>(arg1)

Hope this helps,
Mat

rotteweiler · July 2, 2010, 3:51pm

Hi Mat,

One of my main problems is actually sending a three-dimensional matrix to the device. The GPU does not seem to like it when I allocate the matrix, no matter how small it is.

Currently I am modeling the program based on the matrix multiplication sample code from earlier. In the sample code, the two-dimensional matrices were allocated in the device and data from the host was copied to them before sending them to the kernel.

!Bits and pieces from the sample code:

 real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev
 Adev = A(1:N,1:M)
 allocate( Adev(N,M), Bdev(M,L), Cdev(N,L) )
 call mmul_kernel<<<dimGrid,dimBlock>>>( Adev, Bdev, Cdev, N, M, L )

call foo<<dimGrid,dimBlock>>>(arg1)

How should I send the data to the device?
For example, should I send a matrix called a_device(N,N,N) where you wrote arg1?

Thank you for helping me, you are very helpful.

-Chris

MatColgrove · July 2, 2010, 7:18pm

Hi Chris,

The GPU does not seem to like it when I allocate the matrix, no matter how small it is.

What’s the actual error that you’re seeing?

In your sample, you allocate after trying to copy the data over the the GPU. The allocate must come before the copy else there’s no place to copy the data to. Is this error you’re getting?

Mat

rotteweiler · July 6, 2010, 2:37pm

Hi Mat,

Here is the code I am using to explore 3D Matrices. I copy the data after the allocation to the device and I receive the error:

ConfigureCall FAILED:9

Unless I comment the kernel call out. This code is not mine, it is actually TheMatt’s but his version is for exploring two-dimensional matrices.

module assign_mod
   use cudafor
contains
   attributes(global) subroutine assign_kernel(A, N)
      implicit none

      integer, value :: N
      integer, device, dimension(N,N,N) :: A
      integer, device :: idx, idy
      integer, device :: bx, by, bdimx, bdimy
      integer, device :: tx, ty

      bx = blockidx%x
      bdimx = blockdim%x
      tx = threadidx%x

      by = blockidx%y
      bdimy = blockdim%y
      ty = threadidx%y

     ! bz = blockidx%z
     ! bdimz = blockdim%z
     ! tz = threadidx%z

      idx = (bx-1) * bdimx + tx
      idy = (by-1) * bdimy + ty
     ! idz = (bz-1) * bdimz + tz

      if (idx <= N .and. idy <= N) then
         A(idx,idy,1) = tx
      end if

   end subroutine assign_kernel
end module assign_mod

program main
   use cudafor
   use assign_mod
   implicit none

   integer :: i, j, k
   integer, parameter :: n = 6
   integer, parameter :: blocksize = 2

   integer, allocatable, dimension(:,:,:) :: a_host, b_host
   integer, device, allocatable, dimension(:,:,:) :: a_device

   type(dim3) :: dimGrid, dimBlock

   dimBlock = dim3(blocksize,blocksize,blocksize)
   dimGrid = dim3(n/blocksize,n/blocksize,n/blocksize)
   allocate(a_host(n,n,n))
   allocate(b_host(n,n,n))

   allocate(a_device(n,n,n))           !matrix allocated in device

   forall (i=1:n,j=1:n,k=1:n)
      a_host(i,j,k) = 999
   end forall

   a_device = a_host

   call assign_kernel<<<dimGrid, dimBlock>>> (a_device, n)

    b_host = a_device


   !Show contents of the device array stored in b_host
   do k = 1, n
    do i = 1, n
      do j = 1, n                                          
          if (mod(j,n) == 0) then
           write (*,"(I3)") b_host(i,j,k)                   !Writes Vertically
          else
           write (*,"(I3, 2X)",advance="no") b_host(i,j,k)   !Writes Horizontally
          end if
            !write (*,*) b_host(i,j)
            !write (*,*) j
      end do
    end do
   end do
end program main

Thanks again!

-Chris

MatColgrove · July 6, 2010, 4:08pm

Hi Chris,

The third dimension of your grid needs to be 1.

Change:
dimGrid = dim3(n/blocksize,n/blocksize,n/blocksize)

To:
dimGrid = dim3(n/blocksize,n/blocksize,1)

Hope this helps,
Mat

rotteweiler · July 22, 2010, 4:46pm

Hi Mat,

Sorry for the late response. Yes, I was able to solve the problem and I know how to place three-dimensional matrices in the device. Because I cannot use a threadidx%z or blockidx%z I ended up using only threadidx%x, blockidx%x and blockidy%y. Thank you for you help!

Sincerely,

-Chris

Topic		Replies	Views
Dimensions of a Block and a Grid CUDA Programming and Performance	7	13026	May 1, 2008
declare dim3 type using %x,%y doesn't work Legacy PGI Compilers	1	3083	January 10, 2011
3D grids CUDA Programming and Performance	13	10096	June 29, 2009
3D matrix and 3D threads/blocks problem CUDA Programming and Performance	4	7216	June 16, 2011
Blocking of 3D Matrices! CUDA Programming and Performance	1	1035	October 23, 2008
3D threads index How do you index 3D threads?!!! CUDA Programming and Performance	3	1067	April 2, 2010
How to define a three-dimensional array? define a three-dimensional array on GPU CUDA Programming and Performance	13	11854	October 10, 2008
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5415	October 24, 2009
problem with 3 dimensional thread block with a three dimensional grid, the kernel was not executed CUDA Programming and Performance	6	5981	February 10, 2011
3D grids Legacy PGI Compilers	4	3193	July 6, 2012

Three Dimensional Matrices

Related topics