I’m currently looking for “experience return” about programming (efficiently :D) finite differences algorithm with CUDA.
My problem is in 3D, but involves only 1D finite differences (yes, easier to implement !).
The strategy : every thread loads its own cell from global to shared memory, then uses shared memory to access data of itself and neighbours.
The difficulty is obviously at the block’s boundaries, since the blocks can’t communicate.
I have been thinking about 2 solutions
duplicate the code to compute the new cell values at boundaries
=> divergence in each a half warp, and the time cost for the divergence is the same as the time cost of the first computation, as if the problem dimensions were 2+ times bigger, then, may I say “2^3 = 8 times more cells in 3D” ? It seemed very inefficient to me.
“overlap” blocks : each thread loads the value of it’s associated cell in shared memory, but only 14/16 threads compute and write the result back. (The neighbouring blocks will compute and store the missing values)
=> The cost is : memory loads and stores are not perfectly coalesced (no alignment). On a 1.1 card, this produces terribly poor performances, but on 1.3, this is not bad at all, since it requires only from 1 to 4 memory transactions for each load/store from/to gmem.
(I don’t write the results of each computation directly on the input data, so it minimizes the needs of synchronisation.)
Then, assuming I only use 1.3/2.0 cards, I am very interested in your experience. What do you think of my approach, and what would you do/have you done in similar problems ?
I’ve read this article again and again, but ! Here, the branch divergence for loading halos is at +/- no cost.
Unfortunately, In my problem, each value has to be computed (not just loaded) using values in other arrays, and the computation is quite heavy… so the divergence cost becomes non negligible !
(memory needs doesn’t permit me to store intermediate data, because… there would be dozens of intermediate datas if I wanted to use Micikevicius’ scheme!), and problem size is huge : 7 equally sized arrays of data + 5 ‘temporary’ arrays for intermediate output, we’re far from the two arrays of the article, without speaking of shared memory/register needs)
Obviously, duplicating code is the best solution when threads just have to load values, but what if halos need huge computations ?
My question was not well asked at all, and I concede that my problem is very specific. Anyway, if anybody has suggestions… !
Thank you anyway, this article is very instructive !
In general, overlapping blocks and launching additional threads to deal with the loading of block level boundary data into shared memory is preferable to any other way to do it. It doesn’t cause much, if any, warp divergence and it doesn’t have a large effect on the overall numerical throughput. The hardware can manage the additional threads with almost no performance penalty.
I usually use the method of lines approach (so compute the spatial and temporal discretisations independently). For non-linear implicit schemes (I presume this is what you are asking about?), I prefer eliminate the true boundary ghost cells from the formulation and then just have each 2D block “walk” through the third dimension of the grid to compute the function values for each node/cell.
According to the 3dfd SDK code, there seems to be problems with memory coalescing upon filling in the “in-front” and “behind” data along the z-direction when the stride is not multiple of 32. However, I suspect that this won’t be too much of an issue in 2.0 given 16/48kb of cache memory. In fact, in a simple kernel that looks at the performance hit of an offset array assignment (which is similar to what’s happening in 3dfd), I detected negligible performance degradation. Here is a thread that discusses this.
I suppose one can change the dimension sizes of the finite difference grid to something that is multiple of 32 to see whether or not memory coalescing is much of an issue here.
If I read the original question correctly, you’re dealing with 2nd order in space (3-point stencil, since you mention reading 16 points but writing 14)? You also say that your FD is in one dimension, but which dimension out of 3 is it (fastest-varying, 2nd fastest-varying, slowest-varying)?
If the stencil is not along the fastest-varying dimension, then it’s even simpler - march along that dimension, keeping all the needed values in registers.
I don’t think having 7 input arrays is going to be a problem. Do you apply the stencil to all 7? Even if you do, 3-point stencil is small. For the 2 slowest-varying dimensions that’s 21 registers of state for fp32, 42 registers for fp64 (plus registers for indexing, pointers, etc.). For the fastest-varying dimension that’s 7*(number_of_threads_in_threadblock+2)*sizeof(data_type) bytes of shared memory. Say you use 256-thread blocks and fp32, in which case you end up with ~7KB of smem per threadblock. For fp64 that grows to ~14KB. You should be able to hide latency reasonably well with such smem usage on GT200, on Fermi you could configure for 48KB of smem and run several blocks per SM without a problem.
Thank you very much for your answer and your time !
Each cell needs its 2 neighbours values. All my FD are in one dimensions, but … in each dimension (Yes I guess I’m not really clear) ! So I need 3 kernels, one for each dimension. (It’s a Lagrange-remap scheme, each dimension has to be handled separately, because there is a dimension-specific projection step after the lagrangian step)
Summary :
Lagrangian X => 1D FD along X
Remap X => 1D FD along X
Lagrangian Y => 1D FD along Y
Remap Y => 1D FD along Y
Lagrangian Z => 1D FD along Z
Remap Z => 1D FD along Z
What I am doing actually is to put everything in shared memory for each dimension. The point is that each cell has to :
compute a value with its 2 neighbours values
share it with neighbours
recompute a value with the new values
reshare it
compute again and write back to gmem
(each new computation needs the previous values !)
I think there is no way to do this efficiently by marching along a dimension, am I wrong ? (and without intermediate arrays !)
Having 7 arrays is indeed not a problem of latency/occupancy (I achieved 100% occupancy in all my kernels (16regs/thread + 3kb smem/block and 0 local store/load) without big troubles, with very aggressive register and smem reuse). This is a problem of intermediate data : the algorihm involved is really complex due to the intermediate data that I can’t store if I want to be efficient, and that involves multiple shares in smem
updating the density value = compute velocity at the cells’ sides : that needs : FD of velocity in the centers, FD of pressure in the centers, FD of sound speeds that are computed from another 1 FD of state var to tabulate the law and 1 FD of internal energy (that involves velocities along the 3 axes and total energy of the cell) + FD of densities
updating velocitiy = needs pressure at the sides : needs FD velocity at sides (so, as previously) + FD of pressure (and pressure has to be computed from other vars!)
update total energy : same data
The projection step is far more complex (need the density over 3 time steps to compute intermediate flux, velocities at sides over 2 time steps for intermediate data too)
Here I have 7 arrays : mass fraction, density, 3 velocities, total energy and another state var for the law, everything at the center of the cells. I can’t afford even storing the value at sides !
I first wanted to have feedbacks between the “overlap blocks vs. handle boundaries separately” approaches (especially incidence of misaligned accesses), not specifically to my problem.
Again, the constraints are : huge memory needs and lots of intermediate data that involve FD too !
Then, I cant figure how to apply the 3dfd sample scheme efficiently with intermediate data involving FD and without gmem, I would say that the chosen equation is just “very well suited for GPUs”, am I wrong ?
actually I achived about x75 speedup (with pciexpress bottleneck, from biXeon 3GHz to tesla c1060) so I’m really satisfied ! I’m looking for general considerations of problems involved in FD schemes, like block synchronization and memory usage (can’t write back if the neighbouring blocks need the old value). I implemented some weird tricks to solve this problem while minimizing memory usage, but I’m confused there is nothing on the web about complex FD implementations in cuda !
Does the above describe one step or 3? Or, is that computing the increasing the order in space? Could you roll in the three computations into a single one that takes the values of 6 neighbors as input? If you post pseudo-code or equations that govern this computation it might be easier to understand your case.