Originally published at: https://developer.nvidia.com/blog/finite-difference-methods-cuda-fortran-part-2/

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran. In the last CUDA Fortran post we dove in to 3D finite difference computations in CUDA Fortran, demonstrating how to implement the x derivative part of the computation. In this post, let’s continue by exploring how we…

Hi, I coudn't understand why you chose that block dimensions. What does a perfect memory coalescing mean? Thanks.

Thanks!

Now, why is necessary the instruction below?

do j = threadIdx%y, my, blockDim%y

f_s(i_l,j) = f(i,j,k)

enddo

What's the difference between this and:

j = (blockIdx%y-1)*blockDim%y + threadIdx%y

f_s(i_l,j) = f(i,j,k) ?

I created a global memory version with:

The results using double precision are:

Using shared memory tile of x-y:64x4

RMS error: 0.3380241952306925

MAX error: 1.852001381073753

Average time (ms): 6.2668800E-02

Average Bandwidth (GB/s): 66.92810

Using global memory tile of x-y:64x4

RMS error: 0.3380245069407786

MAX error: 1.852001381073753

Average time (ms): 5.1046401E-02

Average Bandwidth (GB/s): 82.16650

Why are the RMS error different between these versions?

attributes(global) subroutine deriv_x_global(f,df)

implicit none

real(fp_kind), intent(in) :: f(mx,my,mz)

real(fp_kind), intent(out) :: df(mx,my,mz)

integer :: i,j,k

i = threadIdx%x

j = (blockIdx%x-1)*blockDim%y+threadIdx%y

k = blockIdx%y

if (i>3 .and. i<mx-3) then="" (...)="" end="" if="">

The do loop is required because the code uses a 32x8x1 thread block to load a 32x64x1 tile of data into shared memory, so each thread loads 8 elements of the tile.

The statements:

j = (blockIdx%y-1)*blockDim%y + threadIdx%y

f_s(i_l,j) = f(i,j,k)

are used when there is a one-to-one mapping of threads to elements in the tile.