Finite Difference Methods in CUDA C/C++, Part 1

Originally published at:

In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today’s post is to show how to use shared memory to enhance data reuse…

My experiments show that if sPencils=1 the result will be better.Besides, for mesh larger than 64^3, for example 1024^3, shared memory is not enough to support the tile method and become very difficult to design block size and indexing.

Can you provide more details on the hardware you're using?

My program ran on a Quadro K4000 card.

Would it be more or less efficient, if we calculated all the derivatives in the same kernel with smaller (say 16 x 16) block. How does data reuse compare with coalesced memory access, which to be fair may not be prefect in practical cases?

Arbitrarily (large) sized grids - naively, I changed mx=my=mz for the grid size (originally 64^3) to 92^3 (i.e. mx=my=mz=92) and anything above 92, I obtain a Segmentation fault (core dumped). I was simply curious what was happening; is it a limitation on the GPU hardware? If so, which parameter? I'm on a NVIDIA GeForce GTX 980Ti.

The program won't let you set mx/my/mz to 92 because it is not a whole multiple of 32. It will let you set it to 96. When I do this on a Tesla M40 (same compute capability as your 980Ti) I get no errors. When I try it on my older 750 M (in my laptop), I get a segfault and I'm having trouble finding the root cause. What version of CUDA are you using?

I think I figured it out. It looks like it was a stack overflow caused by the arrays f, sol, etc in the function runTest. I didn't get any error, but when they were too large, it would crash as soon as the function is called. I checked in a fix -- just changed to allocate these three arrays on the heap with new instead of making them stack arrays.

Let me know if this lets you use larger sizes.

@Mark_Harris:disqus Thanks for looking into this.

With your latest reply, I'm assuming it's on the github repository? I'll go there right now.

Just for completeness, for the second to last reply, once I was obtaining Segmentation faults for multiples of 32, e.g. 128^3, I made a copy of the code and commented out the "long pencils" lpencils and tested out the "small pencils" derivatives and tried to get to where it fails (small pencils, unless I'm wrong, I think only need multiples of 4 for the tile sizes), and "got up to" 92.

Here's the device name print out and compute capability; I can print out more info if it may help in the next reply (i.e. cudaGetDeviceProperties)

Device name: GeForce GTX 980 Ti
Compute Capability: 5.2

Segmentation fault (core dumped)

I'm on CUDA ToolKit 7.5

P.S. I became very interested in your post because I wanted to implement 3-dim. Navier-Stokes Eq. with finite-difference methods on CUDA C/C++ and obviously derivatives are needed for the gradient and divergence. Has this already been done? (and is it on github?) I Google and github searched as much as I could already.

I made a copy from and I changed the grid size to 352^3 - any choice above 352, e.g. say for mx=384, but my=mz=352, or all 3, mx=my=mz=352, I obtain a compilation error (only doing nvcc in the direction with too large of a grid dimension:

ptxas error : Entry function '_Z21derivative_x_lPencilsPfS_' uses too much shared data (0xc400 bytes, 0xc000 max)

Is this because of the inherent limitation of the GTX 980Ti itself? (I can cudaGetDeviceProperties and print out the max. threads per dimension) Or something inherently limited about shared data per block and so this method can't scale beyond, on the GPU's shared data per block? For instance, I can do a naive implementation of derivatives on total threads in 1 dim. of 1920 threads (1920x1920x32) on the __global__ memory of the device GPU, 16x16x4 threads per block; any much "bigger" kernel launch I do results in segmentation fault. Using that pencil method, as an improvement of the "naive" shared memory method (i.e. tiling per block, with 2 "dark orange" region blocks, the 4x16 in the very first example in the article above), I believe you'd need a 1920x4 tile, which may not work, if I can only compute the derivative on a 352x352x352 grid? (note 352^3 < 1920*1920*32)

You need to modify the code to decouple your total domain size from the tile size. The tile size is limited by available shared memory (see the post on shared memory). But there's no reason you can't do domains that are much larger (limited by device memory, or system memory if you use Unified Memory).

I don't understand what it means to "decouple" the total domain size from the tile size. Let's take the concrete example of The grid dimensions of the desired grid is (mx,my,mz). The size of a tile, as above in this article, is mx*sPencils (for the case of the x-direction). This seems to be the heart of the "pencils" scheme, choosing the size of the tile in the direction of the derivative to be the entire mx (for the x-direction). This scheme is unlikely to be to scale at all, as look at the grid dimensions and the block dimensions that you're seeking to launch for each of the kernel function (calls) derivative_x (and derivative_y, derivative_z):

(my/spencils, mz,1) for the grid dimensions (number of blocks in each dimension) and
(mx,spencils,1) for the block dimensions (number of threads in each (thread) block)

Then you'd have mx*spencils*1 threads to launch on each block, which is severely limited by the maximum number of threads per block.

I do cudaGetDeviceProperties (or I run the code off my github repository and on the GTX 980Ti, I get 1024 Max threads per block.

However, the number of blocks that can be launched on the grid is much higher (Max grid dimensions: (2147483647, 65535, 65535) ) and so to possibly "save" the pencil method would be to try to launch more blocks.

Also note that we're, in this code,, demanding __shared__ float s_f[sPencils][mx+8] an array of floats of size sPencils*(mx+8) and if we want our grid to scale to larger values than 64, say mx = 1920, then we'd have severe demands on the shared memory. I'm not sure if the domain size, mx in this case, can be decoupled from the tile size, which I thinking is, correct me if I'm wrong, this __shared__ 2-dim. array in this case, which depends directly on mx, without throwing out this idea of pencils entirely.

I don't know how to specifically modify the example here ( to scale beyond a grid of 354^3: I'm suspecting this is the maximum because of the Max threads per block (1024) for the GTX 980 Ti. I've tried changing the size of I also don't understand how to decouple this domain size from the tile size, without throwing out the entire idea of pencils.

I put a write up of this post, because, at least to me, writing in LaTeX is much clearer, here:
pp. 30 or the section "Note on finite-difference methods on the shared memory of the device GPU, in particular, the pencil Then the total number of threads launched in each direction, x and y, is method, that attempts to improve upon the double loading of boundary “halo” cells (of the grid)."

Threads can execute loops, so you can loop and process multiple tiles of the entire domain using a single thread block. don't try to fit the entire domain in shared memory at once. That's what I mean by decouple.

I'm sorry Ernest, but I have to leave this as an exercise for you, I don't have time to solve it for you. I didn't develop this approach, I only translated the Fortran posts that Greg Ruetsch wrote to C++ (

Thanks a lot for these clear tutorials!

When I run the FDM code on my GTX970 with CUDA 8, there is almost no difference between 4x64 and 32x64 for Y and Z derivatives. Does it mean that coalescence is not so important any more ? :)