I have to get familiar with CUDA to work on a CFD problem (FDS). I got some good basic understanding about the memory allocations and techniques behind CUDA, but I still have a few questions that aren’t clear.

How do I define the grid size, number of blocks, number of thread etc. for a problem that has no fixed “matrix” size ? Is this just depending on my hardware, how to define these values? Is there a way of automatically or dynamically define these values ?

How do I handle unstructured meshes, where the cells are not symmetrical ?

many thanks in advance, these are a few points that are not clear after reading the guides and literature.

I have to get familiar with CUDA to work on a CFD problem (FDS). I got some good basic understanding about the memory allocations and techniques behind CUDA, but I still have a few questions that aren’t clear.

How do I define the grid size, number of blocks, number of thread etc. for a problem that has no fixed “matrix” size ? Is this just depending on my hardware, how to define these values? Is there a way of automatically or dynamically define these values ?

How do I handle unstructured meshes, where the cells are not symmetrical ?

many thanks in advance, these are a few points that are not clear after reading the guides and literature.

To make it simple, choose the number of threads in a block to be a multiple of warpsize (32 currently). From this choose the number of blocks so that the entire matrix is covered.

The best configuration depends a lot on the algorithm and the problem size. There is really no fixed standard configuration. I tend to use a fixed threadBlock size initially (say 128 threads) and work from there.

The grid size (and blockSize) should be so chosen such that the hardware capability is fully exploited. You can launch only 2 blocks on a GTX480 with 32 Streaming multiprocessors but that will not take full advantage of the system.

Yes. You can write the code such that more blocks can be launched for larger data sizes.

For unstructured meshes launch more blocks / threads than required and discard the useless values. Else you can pad the input matrix with zeroes so that it fits a given size and work on it. - (This may not work in all cases )

To make it simple, choose the number of threads in a block to be a multiple of warpsize (32 currently). From this choose the number of blocks so that the entire matrix is covered.

The best configuration depends a lot on the algorithm and the problem size. There is really no fixed standard configuration. I tend to use a fixed threadBlock size initially (say 128 threads) and work from there.

The grid size (and blockSize) should be so chosen such that the hardware capability is fully exploited. You can launch only 2 blocks on a GTX480 with 32 Streaming multiprocessors but that will not take full advantage of the system.

Yes. You can write the code such that more blocks can be launched for larger data sizes.

For unstructured meshes launch more blocks / threads than required and discard the useless values. Else you can pad the input matrix with zeroes so that it fits a given size and work on it. - (This may not work in all cases )

To make it simple, choose the number of threads in a block to be a multiple of warpsize (32 currently). From this choose the number of blocks so that the entire matrix is covered.

I guess that in my fluid dynamic case (fire simulation), the matrix size will depend on the geometry of the room and events like heat sources and the different meshing for each problem. Or am I wrong ?

Or do I define a algorithm like "(detected matrix size / fixed thread number (max support of my hardware) ) = number of blocks ?

To make it simple, choose the number of threads in a block to be a multiple of warpsize (32 currently). From this choose the number of blocks so that the entire matrix is covered.

I guess that in my fluid dynamic case (fire simulation), the matrix size will depend on the geometry of the room and events like heat sources and the different meshing for each problem. Or am I wrong ?

Or do I define a algorithm like "(detected matrix size / fixed thread number (max support of my hardware) ) = number of blocks ?

I am not sure I follow this. Surely if you have a mesh, you have a known number of nodes or integration points, and a known number of degrees of freedom per node. So the equivalent size of the resulting set of linear equations is fixed (or the equivalent “matrix size”). You determine the execution grid in CUDA based on the total amount of data parallel work, and the system size gives you the upper bound on the amount of work.

That is a bit of a “how long is a piece of string” question, but graph coloring of the mesh is a good approach. If you use something like metis to color the mesh during pre-processing, you will wind up with a set of independent color regions which can be safely solved in parallel without any ordering problems. In general, using coordinate and index arrays is better than using pointers because of the difficulty in building and maintaining pointer trees in device memory. The parallel prefix-sum and reduction operations are very powerful ways to perform “assembly” of the partial results from multiple kernel calls for different mesh colors or sweeps/passes in a give coordinate direction.

If you are looking for some inspiration, Jonathan Cohen from NVIDIA research has published some good material on CUDA for CFD and other PDEs, for example this paper. He also maintains a library of tools for solving convection-diffusion problems called opencurrent, which you might want to take a look at.

I am not sure I follow this. Surely if you have a mesh, you have a known number of nodes or integration points, and a known number of degrees of freedom per node. So the equivalent size of the resulting set of linear equations is fixed (or the equivalent “matrix size”). You determine the execution grid in CUDA based on the total amount of data parallel work, and the system size gives you the upper bound on the amount of work.

That is a bit of a “how long is a piece of string” question, but graph coloring of the mesh is a good approach. If you use something like metis to color the mesh during pre-processing, you will wind up with a set of independent color regions which can be safely solved in parallel without any ordering problems. In general, using coordinate and index arrays is better than using pointers because of the difficulty in building and maintaining pointer trees in device memory. The parallel prefix-sum and reduction operations are very powerful ways to perform “assembly” of the partial results from multiple kernel calls for different mesh colors or sweeps/passes in a give coordinate direction.

If you are looking for some inspiration, Jonathan Cohen from NVIDIA research has published some good material on CUDA for CFD and other PDEs, for example this paper. He also maintains a library of tools for solving convection-diffusion problems called opencurrent, which you might want to take a look at.