Finite Difference Methods in CUDA Fortran, Part 2

jwitsoe · October 9, 2013, 6:08pm

Originally published at: https://developer.nvidia.com/blog/finite-difference-methods-cuda-fortran-part-2/

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran. In the last CUDA Fortran post we dove in to 3D finite difference computations in CUDA Fortran, demonstrating how to implement the x derivative part of the computation. In this post, let’s continue by exploring how we…

anon21021129 · January 7, 2019, 3:13am

Hi, I coudn't understand why you chose that block dimensions. What does a perfect memory coalescing mean? Thanks.

anon95180265 · January 7, 2019, 4:39am

Suggest you read this. https://devblogs.nvidia.com...

anon21021129 · January 8, 2019, 12:39pm

Thanks!

Now, why is necessary the instruction below?

do j = threadIdx%y, my, blockDim%y
f_s(i_l,j) = f(i,j,k)
enddo

What's the difference between this and:

j = (blockIdx%y-1)*blockDim%y + threadIdx%y

f_s(i_l,j) = f(i,j,k) ?

anon21021129 · January 8, 2019, 3:29pm

I created a global memory version with:

The results using double precision are:

Using shared memory tile of x-y:64x4
RMS error: 0.3380241952306925
MAX error: 1.852001381073753
Average time (ms): 6.2668800E-02
Average Bandwidth (GB/s): 66.92810

Using global memory tile of x-y:64x4
RMS error: 0.3380245069407786
MAX error: 1.852001381073753
Average time (ms): 5.1046401E-02
Average Bandwidth (GB/s): 82.16650

Why are the RMS error different between these versions?

attributes(global) subroutine deriv_x_global(f,df)
implicit none
real(fp_kind), intent(in) :: f(mx,my,mz)
real(fp_kind), intent(out) :: df(mx,my,mz)
integer :: i,j,k
i = threadIdx%x
j = (blockIdx%x-1)*blockDim%y+threadIdx%y
k = blockIdx%y

if (i>3 .and. i<mx-3) then="" (...)="" end="" if="">

anon27139054 · January 8, 2019, 11:00pm

The do loop is required because the code uses a 32x8x1 thread block to load a 32x64x1 tile of data into shared memory, so each thread loads 8 elements of the tile.

The statements:

j = (blockIdx%y-1)*blockDim%y + threadIdx%y
f_s(i_l,j) = f(i,j,k)

are used when there is a one-to-one mapping of threads to elements in the tile.

Topic		Replies	Views
Finite Difference Methods in CUDA C++, Part 2 Technical Blog	3	559	March 16, 2017
Finite Difference Methods in CUDA C/C++, Part 1 Technical Blog	13	729	May 11, 2017
Your experience with finite differences CUDA Programming and Performance	9	2749	July 10, 2010
Finite Difference Methods in CUDA Fortran, Part 1 Technical Blog	0	386	August 25, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3379	January 10, 2010
performance unchanged by shared memory 1D heat equation attempt CUDA Programming and Performance	15	4908	July 12, 2009
How would you do this? CUDA Programming and Performance	12	4481	August 5, 2008
An Efficient Matrix Transpose in CUDA Fortran Technical Blog	2	442	February 5, 2014
coalescing a finite difference operator CUDA Programming and Performance	2	4921	February 19, 2010
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11777	March 20, 2009

Finite Difference Methods in CUDA Fortran, Part 2

Related topics