Hi all. I’m in the process of writing a three-dimensional Lattice Boltzmann code for a GPU. To do this, I had to tackle the problem of having a grid that is either one- or two-dimensional.

I elected to use one-dimensional arrays for storage and a one-dimensional grid. An advantage of this approach is that the user can choose dimensions of arbitrary size, and memory alignment won’t be an issue so long as LXxLYxLZ is a multiple of 16 (or the array is padded at the very end). A disadvantage of this approach is that I still need to know where the physical boundaries are, and so I need

int n = threadIdx.x + blockIdx.x * blockDim.x;
int nx = n % LX;
int ny = (n % (LX * LY)) / LX;
int nz = n / (LX * LY);

in each kernel.

These % operations are expensive, as is integer division, and I find myself wishing that CUDA simply supported 3D grids, so that I could have 1D arrays, 3D grids, and generate n from the three thread indices.

My question is, what other kinds of solutions have people come up with when dealing with this problem? Are there better ways to do it, or are there simply many ways, each with a tradeoff of some kind?

Incidentally, while looking into this problem, I came across this curious paper from an IEEE conference proceeding. In it, the author simply uses a three dimensional grid to solve a problem in a three dimensional domain without any mention of the fact that 3D grids aren’t supported. Any comments?

3D grids are supported in compute capability 2.x devices. On 1.x devices, where dimGrid.z has to be 1, you can use a simple loop inside the kernel instead.

Thanks. I’d read that 3D grids were technically possible in 2.#, but that the software was not yet in place to support it. From your comment, I assume the software is now in place.

I’d thought of the simple loop idea, but the problem with that is if a user (that isn’t familiar with GPUs) wants to simulate a problem that is small in LX and LY and very large in LZ, and chooses the z-direction to be perpendicular to the 2D grid, they’ll have performance issues.

Still, I suppose I could take that choice away from the user, and make sure that the grid is always assigned to the largest plane out of LXLY, LXLZ and LY*LZ.

Of course, there’s always the possibility that all 3 dimensions will be small enough that treating any single plane with a 2D grid and summing in the remaining direction would be inefficient, but processing all of the nodes with a 1d grid would be efficient.

It depends on your problem. But if there is no interaction with the local neighbors (i±1,j±1) the 2D and 3D arrays can be treated in the kernel as a 1D array as it is anyway stored in the memory. You can have a 3D grid of blocks in which the threads are also a 3D. So in practice you can play with 6 indeces. You are right you need to see for the special cases (quasi 2D, quasi 1D) how to choose the optimal way.

Well, one problem that could occur with my method (that I just realized) is that my grid may not be big enough. Suppose I want to do a single precision D2Q9 lattice boltzmann model - I would need 9 + 2 + 1 floating point numbers per lattice site. This means I can deal with a system containing 83 million lattice sites on my Tesla C1060 with 4GB.

With a one dimensional grid however, I can only have a maximum of 65535 * 512 = 33 million threads for Compute 1.3 and 67 million threads for Compute 2.0. This means I could not overlay a single grid on my physical domain.

This is a problem! Perhaps its time I considered 2 and 3 dimensional grids.