Finite Difference Methods in CUDA Fortran, Part 2

Originally published at: https://developer.nvidia.com/blog/finite-difference-methods-cuda-fortran-part-2/

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran. In the last CUDA Fortran post we dove in to 3D finite difference computations in CUDA Fortran, demonstrating how to implement the x derivative part of the computation. In this post, let’s continue by exploring how we…

Hi, I coudn't understand why you chose that block dimensions. What does a perfect memory coalescing mean? Thanks.

Suggest you read this. https://devblogs.nvidia.com...

Thanks!

Now, why is necessary the instruction below?

do j = threadIdx%y, my, blockDim%y
f_s(i_l,j) = f(i,j,k)
enddo

What's the difference between this and:

j = (blockIdx%y-1)*blockDim%y + threadIdx%y

f_s(i_l,j) = f(i,j,k) ?

I created a global memory version with:

The results using double precision are:

Using shared memory tile of x-y:64x4
RMS error: 0.3380241952306925
MAX error: 1.852001381073753
Average time (ms): 6.2668800E-02
Average Bandwidth (GB/s): 66.92810

Using global memory tile of x-y:64x4
RMS error: 0.3380245069407786
MAX error: 1.852001381073753
Average time (ms): 5.1046401E-02
Average Bandwidth (GB/s): 82.16650

Why are the RMS error different between these versions?

attributes(global) subroutine deriv_x_global(f,df)
implicit none
real(fp_kind), intent(in) :: f(mx,my,mz)
real(fp_kind), intent(out) :: df(mx,my,mz)
integer :: i,j,k
i = threadIdx%x
j = (blockIdx%x-1)*blockDim%y+threadIdx%y
k = blockIdx%y

if (i>3 .and. i<mx-3) then="" (...)="" end="" if="">

The do loop is required because the code uses a 32x8x1 thread block to load a 32x64x1 tile of data into shared memory, so each thread loads 8 elements of the tile.

The statements:

j = (blockIdx%y-1)*blockDim%y + threadIdx%y
f_s(i_l,j) = f(i,j,k)

are used when there is a one-to-one mapping of threads to elements in the tile.