I know what this error message means but I don’t know where the error comes from. I also know that the limitations for dimBlock x, y, z is 512, 512, 64 and in my program I don’t hit this limitation. When I set dimGrid.z = 1 the kernel call works. So where is my problem? Am I blind?
Only NVIDIA knows for sure, but I am guessing their scheduling hardware/software uses a single 32 bit word for block ID (which gives the 65335 x 65335 limit). It also makes some sense if you bear in mind the original purpose of all the current GPUs, ie. rendering, which basically resolves down to a 2D grid of pixels for display with underlying 3D calculations for the values of each pixel. So 3D blocks with 2D grids isn’t such a surprising choice.
Of course this might all change with the next generation Fermi devices, but that is only speculation…
It think it will depend a lot on your global memory access patterns.
I have a number of different 3D finite difference discretisation kernels for convection-diffusion equations which require 3 or 5 point stencil data in each spatial direction. For memory coalescing reasons, I wound up using a 2D block and grid structure, and have each thread traverse the third spatial dimension, maintaining a local shared memory cache of the data points necessary for constructing stencils in each block. In that arrangement, some threads per block don’t participate other than to do memory reads, and for branch divergence reasons, some do redundant calculation which get discarded, but it is still faster overall because it guarantees all global memory reads and writes are coalesced and keeps divergence and shared memory bank conflicts to a minimum (in the case of the double precision versions of the code).
If your data usage patterns are much closer to 1 global memory read per thread, or you are using textures, it probably doesn’t matter that much. The key thing is multiples of 32 threads per block for scheduling and memory coalescing reasons, and at least 192 threads per multiprocessor to amortise instruction pipelining costs.
Yes, two. At each iteration, all threads in a block to their global reads and shared memory updates and then synchronize. After that, the calculating threads do their work, then the participatory threads do their writes, while the idle threads sit at a second thread barrier. Then the loop continues.
If you are referring to the Micikevicius paper, then (if my memory serves me correctly) they didn’t use the same strategy as what I am describing. In that implementation, they did the differencing in two passes with two separate kernels, firstly a doing a 2D differencing (x-y) pass, then a 1D z pass on the results of the 2D differencing. They can do that because the high order central scheme they use can be decomposed into a set of partial sums. The difference operators for the hyperbolic terms I uses can’t easily be done that way, so I do the differencing calculations in a single pass, but using a 2D kernel grid.
I have done some very similar things recently, though I was using upwind differencing and I think I may have structured it in a slightly different way. What sort of memory bandwidth/FLOPS were you getting?
It has been a while since I did any profiling and I haven’t really done all that much in the way of serious optimization, but I seem to recall that the explicit version of the 3D differencing kernel (so one using ghost cells) was hitting about 95Gb/s in single precision when I did look at it in the profiler running on a stock GTX275. I think the theoretical bandwidth of that card is around 127Gb/s, so there should still be more to be had out of it, but I haven’t be using the explicit code all that much in the last couple of months.