Accelerating a Gridder

I’m attempting to accelerate a gridding algorithm in CUDA, and I’m hitting some serious problems getting the code to run faster on the GPU.

I have a list of sources (around 130k) which have to be placed on a 1500x1500 grid. Each source affects a 5x5 (or 7x7… this is a runtime decision for all sources) patch of the larger grid. The grid is of complex numbers (and there are actually four of them, but that doesn’t affect the algorithm; we just have to do the same thing once for each grid).

The CPU based code loops over all sources, adding the effect of each to each grid cell. This takes about 500ms. However, since each pixel in the large grid can potentially be affected by more than one source, it’s not safe to give each GPU thread a source (I’ve verified this with a test kernel - quite a few sources get missed due to race conditions).

So, in the CUDA code, I give each cell in the 1500x1500 grid to a thread, and let each thread loop over sources. This is thread-safe but hideously slow (as in around 20x slower than the CPU implementation), since most grid cells only have one or two sources in range (and quite frequently, zero if you look at the numbers). RIght now, I’m trying to use texture arrays to speed up the source-lookups, but I’m not certain this is going to result in any substantial speed gains - at least as compared with the CPU implementation.

Someone else had worked on an earlier version of the code, and had used OpenGL in this portion. Apparently the pixel pipelines there can do atomic adds on floats in the process of compositing an image. Unfortunately, I don’t know OpenGL.

Has anyone else come across a similar problem, and if so, how did you solve it within CUDA? Thanks in advance…

My app needs to grid ~64k points into a 15x15x15 grid and not only do I need to count the number of points in each grid, but I need to build a list of each particle in the grid.

My solution is still to grid on the CPU :( Even with the cost of copying the points from the GPU and copying the grid back to the GPU it is faster than anything I’ve been able to come up with on the GPU.

I can offer some advice, though. Maybe with the differences between our situations you’ll at least be able to get some amount of speedup over the CPU.

So, in the CUDA code, I give each cell in the 1500x1500 grid to a thread, and let each thread loop over sources. This is thread-safe but hideously slow

Do you have each thread read the sources from global memory? A texture won’t really help you here. What you need is shared memory. Have each thread load one source into shared memory (fully coalesced), syncthreads and then every thread in the block can loop over all sources in shared memory invoking the broadcast mechanism. After you finish that loop, sychthreads again, move the sliding window over to the next set of sources, load them into shared memory and do it all again… (I can post code if you want).

This is still O(NM), while the CPU algorithm is only O(N), but maybe it will turn the 20x speed loss into a 2x speedup.

I tried shared memory too… performance was not especially impressive. RIght now I’m just trying to figure out why the writes to global memory aren’t being coalesced.

__global__ void ConvolveKernelSimple( const cuFloatComplex *visibilities,

          const float *uVals, const float *vVals,

          const int nVis,

          const float pixelSize,

          const int kernelSize,

          const int uSize, const int vSize,

          cuFloatComplex *uvGrid ) {

 // Kernel to take a list of visibilities, and grid them

  // using the spheroid convolution kernel

  

  cuFloatComplex myVis[kDevicePols];

  // Get cell indices

  int iv = threadIdx.x + (blockDim.x*blockIdx.x);

  int iu = threadIdx.y + (blockDim.y*blockIdx.y);

 // Compute location of pixel centre

  float myu = (iu - (uSize/2))*pixelSize;

  float myv = (iv - (vSize/2))*pixelSize;

 // Extent of kernel

  float kernelExtent = ((float)kernelSize*pixelSize) / 2;

 // Zero accumulator

  for( int iPol=0; iPol<kDevicePols; iPol++ ) {

    myVis[iPol] = make_cuFloatComplex( 0, 0 );

  }

[...]

 // Store if in range

  if( iu < uSize && iv < vSize ) {

    for( int iPol=0; iPol<kDevicePols; iPol++ ) {

      uvGrid[iv + (iu*vSize) + (iPol*uSize*vSize)] = myVis[iPol];

    }

  }

According to cudaprof, none of the writes in the final section are coalesced, which is making me scratch my head right now. I have another routine similar to this which operates on a float array which does manage to coalesce a lot of its writes :(

An update on this… using shared memory actually slows things down. As written, each thread checks whether each source is in range, and skips it if not. This saves a load of 4 complex floats from global memory if none of the threads in the block want the source. If the source list is broken down and read into shared memory, then this is lost, since we don’t know in advance if one of the threads in the block is going to need the source.

Just brainstorming here, but perhaps you could experiment with a divide-and-conquer approach by doing coarse-to-fine binning in multiple passes.

Something like taking one pass through to make 100 new sublists, one for each 150x150 subsquare of your image. You might do this recursively to break into even smaller partitions as needed, by using new passes. Then you can follow up with a second kernel that applies the effects from a small list to just to a (say, 15x15) subsquare.

You’d have to experiment with sizes and such, and probably your second kernel would be designed to fit the small subgrid into shared memory.

It’s really hard to tell if it’d be a speedup without trying though. Your 130K samples are a relatively short list size compared to the grid resolution so dumb CPU may win overall.

I’ve been working on something like that today, putting the sources into 10x10 pixel bins, and using those to cut down the number of sources each pixel iterates over. Results have not been encouraging so far, unfortunately. What’s really frustrating is the existence of an OpenGL implementation which is (apparently) substantially faster, taking advantage of glBlendFunc().

OK, more brainstorming, based on the idea of a short sample list and a large grid.

How about sorting the source list by 1D index, but modifying the sort so that if two values have the same coordinate, you flag the second (and later) as a “dupe.” That flag may be a separate list or just a bit set or something. The idea of sorting is mostly to find duplicate values.

Then process the list. It’s sorted, so all the samples that affect an output row in your grid are already grouped together and it’s easy to deal with. And the dupe tags tell you when you could have a collision, so for your population pass you can simply skip the dupes for that first pass, and not have to worry about write contention. If dupes DO exist you can go back and deal with them even slowly… if dupes are rare, use atomics, if dupes are common, do a reduction on the list of samples that all goes to the same coordinate.

This algorithm scales much nicer when the grid size goes up or the sample list goes down.

I have no clue what a gridding algo is. But here is my 2 cents. HTH.

If you are using 2D block (i.e. threads inside a block have 2 dimensions) – then you may need to test switching the “y” and “x” dimensions of a block… Like if it was 32x16 == try 16x32…

The warp scheduling for 2D blocks is not documented fully AFAIK.

Also, when you flip dimensions – you need to re-calculate the grid dimensions as well – if your grid dimensions depend on block dimensions…

I cracked it in the end. External Image

The trick was to divide the grid up into bins of 10x10 (say) pixels. Each source could then be quickly assigned to a bin. I then sorted the sources by the bin they were in, and used this to make a list of the first and last source in each bin. Right now, this is all done on the CPU (I can’t use CUDPP’s sort routine, since I have to sort multiple arrays by a key array). The GPU is given the sorted list of sources, and the lists of first and last particles. Each thread (still working on a single pixel in the final grid) can then use the bin lists to pare down the number of sources which need to be examined. And since all the threads in a given block are working on adjacent pixels, they usually want the same sources. This is now 800x faster than the original GPU implementation, and 10x faster than the CPU gridder.

The trick of exchanging the thread assignment order for the arrays didn’t seem to help much here, but it has in other routines, where it suddenly enabled a bunch of memory accesses to be coalesced.

How do you deal with a source that affects pixels in multiple bins? E.g., a source at 9,9 affects not just pixels in its 10x10 bin (0-9,0-9), but those in three neighboring bins: (10-19,0-9), (0-9,10-19), and (10-19,10-19).

Brilliant! (why didn’t I think of that External Media)

Since my code already sorts particles by bin to improve cache-coherency this should be trivial to add to my binning algorithm. If only I didn’t have to prepare for NVISON and more travel after that, I’d try it right now.

Thanks for sharing your solution.

You know that each pixel can only be affected by the sources in its own bin, and those in the 8 bins surrounding it. You loop over those.

I am not sure that this is the same thing, but try to take a look at the gridding (convolution) described at the link below. It easily translates to Cuda:

[url=“Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware | IEEE Journals & Magazine | IEEE Xplore”]http://ieeexplore.ieee.org/xpl/freeabs_all...unt=18&index=11[/url]

/ Thomas