# Three Dimensional Matrices

Hi everyone!

I am currently exploring how to use three dimensional matrices in the kernel but it is not as simple as I thought. According to the CUDA Fortran
Programming Guide and Reference the values of blockidx%z and griddim%z must always be one. Keeping constraints in mind, I know a block cannot have more than 512 threads. If the device will only accept two dimensional matrices, what would be the best way to take a three dimensional matrix and send it to the kernel?

This is the code I am trying to implement:

``````do m = 1, mba
do j = 2, nj-1
do i = 2, ni-1

PHIN(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
+ AS(i,j,m) * PHI(i,j-1,m)&
+ AE(i,j,m) * PHI(i,j+1,m)&
+ AW(i,j,m) * PHI(i,j-1,m)&
+ AP(i,j,m) * PHI(i,j+1,m)
enddo
enddo
enddo
``````

Any feedback would be helpful since I am new to CUDA Fortran. Thank you for your time!

-Chris

I’m fairly new to cuda fortran as well, so take my advice with a grain of salt. However, would it be possible to declare a one dimensional dimGrid and a two dimensional dimBlock? You could then reference your i,j,k variables with:

m= blockidx%x

Or at least something similar. I’m planning on doing something similar for the project I’m working on. The way I see it, you’re taking your 3D block and dividing it up into slices of 2D planes, referenced by ‘m’. Each cuda block handles one slice, while your 2D array of threads handles each element of the plane, referenced by ‘i’ and ‘j’.

I know this isn’t the same approach the PGI guide takes in its example, so I’m right here with you wondering if this is a legit method.

Thank you!
I will work on it and see what constraints the device gives me.

Sincerely,

Chris

Definitely let me know how it goes. Keep in mind with that technique the max (square) slice size is 22x22 to keep within the 512 thread limit.

Hi Chris,

Assuming “ni” is large, I would make your “m” loop be correspond to “blockid%x” and “j” to “blockid%y”. Next have your kernel execute multiple iterations of i. Something like:

``````
m  = blockIdx%x-1
j  =  blockIdx%y  ! adjust for the starting value of 2
nthrds = blockDim%x

do i = tidx, ni-1, nthrds
PHIN(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
+ AS(i,j,m) * PHI(i,j-1,m)&
+ AE(i,j,m) * PHI(i,j+1,m)&
+ AW(i,j,m) * PHI(i,j-1,m)&
+ AP(i,j,m) * PHI(i,j+1,m)
enddo
enddo
enddo
``````

And your kernel launch would be something like:

``````
type(dim3) :: dimGrid, dimBlock
...
dimBlock = dim3(512,1,1)
dimGrid = dim3(mba,nj-2,1)
call foo<<dimGrid,dimBlock>>>(arg1)
``````

Hope this helps,
Mat

Hi Mat,

One of my main problems is actually sending a three-dimensional matrix to the device. The GPU does not seem to like it when I allocate the matrix, no matter how small it is.

Currently I am modeling the program based on the matrix multiplication sample code from earlier. In the sample code, the two-dimensional matrices were allocated in the device and data from the host was copied to them before sending them to the kernel.

``````!Bits and pieces from the sample code:

real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev
call mmul_kernel<<<dimGrid,dimBlock>>>( Adev, Bdev, Cdev, N, M, L )

call foo<<dimGrid,dimBlock>>>(arg1)
``````

How should I send the data to the device?
For example, should I send a matrix called a_device(N,N,N) where you wrote arg1?

Thank you for helping me, you are very helpful.

-Chris

Hi Chris,

The GPU does not seem to like it when I allocate the matrix, no matter how small it is.

What’s the actual error that you’re seeing?

In your sample, you allocate after trying to copy the data over the the GPU. The allocate must come before the copy else there’s no place to copy the data to. Is this error you’re getting?

• Mat

Hi Mat,

Here is the code I am using to explore 3D Matrices. I copy the data after the allocation to the device and I receive the error:

ConfigureCall FAILED:9

Unless I comment the kernel call out. This code is not mine, it is actually TheMatt’s but his version is for exploring two-dimensional matrices.

``````module assign_mod
use cudafor
contains
attributes(global) subroutine assign_kernel(A, N)
implicit none

integer, value :: N
integer, device, dimension(N,N,N) :: A
integer, device :: idx, idy
integer, device :: bx, by, bdimx, bdimy
integer, device :: tx, ty

bx = blockidx%x
bdimx = blockdim%x

by = blockidx%y
bdimy = blockdim%y

! bz = blockidx%z
! bdimz = blockdim%z

idx = (bx-1) * bdimx + tx
idy = (by-1) * bdimy + ty
! idz = (bz-1) * bdimz + tz

if (idx <= N .and. idy <= N) then
A(idx,idy,1) = tx
end if

end subroutine assign_kernel
end module assign_mod

program main
use cudafor
use assign_mod
implicit none

integer :: i, j, k
integer, parameter :: n = 6
integer, parameter :: blocksize = 2

integer, allocatable, dimension(:,:,:) :: a_host, b_host
integer, device, allocatable, dimension(:,:,:) :: a_device

type(dim3) :: dimGrid, dimBlock

dimBlock = dim3(blocksize,blocksize,blocksize)
dimGrid = dim3(n/blocksize,n/blocksize,n/blocksize)
allocate(a_host(n,n,n))
allocate(b_host(n,n,n))

allocate(a_device(n,n,n))           !matrix allocated in device

forall (i=1:n,j=1:n,k=1:n)
a_host(i,j,k) = 999
end forall

a_device = a_host

call assign_kernel<<<dimGrid, dimBlock>>> (a_device, n)

b_host = a_device

!Show contents of the device array stored in b_host
do k = 1, n
do i = 1, n
do j = 1, n
if (mod(j,n) == 0) then
write (*,"(I3)") b_host(i,j,k)                   !Writes Vertically
else
write (*,"(I3, 2X)",advance="no") b_host(i,j,k)   !Writes Horizontally
end if
!write (*,*) b_host(i,j)
!write (*,*) j
end do
end do
end do
end program main
``````

Thanks again!

-Chris

Hi Chris,

The third dimension of your grid needs to be 1.

Change:
dimGrid = dim3(n/blocksize,n/blocksize,n/blocksize)

To:
dimGrid = dim3(n/blocksize,n/blocksize,1)

Hope this helps,
Mat

Hi Mat,

Sorry for the late response. Yes, I was able to solve the problem and I know how to place three-dimensional matrices in the device. Because I cannot use a threadidx%z or blockidx%z I ended up using only threadidx%x, blockidx%x and blockidy%y. Thank you for you help!

Sincerely,

-Chris