Hardware / Programming language limitations

I am designing an algorithm for solving a specific type of partial differential equation.
The algorithm is essential sequential but I managed to turn it into a parallel one for the sake of exploiting the GPU. Hardware available for me is either 7950GTX or 8800GTX.
Without getting into technical details of the algorithm, I would like to explain briefly the algorithm.
I have a 2D array with the current state of the solution to the PDE. In order to solve the equation, I need to perform the following task:
Process a single column of the array. All the elements/pixels in the column can be processed in parallel. The pixels in the column depend only on pixels from the column on the left and the current pixel which is being processed. Each pixel depend on it self (i,j), the pixel on the left (i-1,j), the left up (i-1,j+1) and left down (i-1, j-1).

After I complete to scan the grid column by column, left to right I need to perform the same task 3 more times:
One scan of columns from right to left. Another scan of rows from top to bottom and last, a scan of rows from bottom to top.
So completely I have 4 sweeps of the plane in 4 directions.

The challenges I have with respect to GPU hardware limitations are:

  1. Rendering a single line is inefficient since raster hardware work in 2x2 blocks. Half of the processors might be idle. I am using CG right now. Does CUDA has this limitation or does it by pass the fixed raster functionality? When using GL_LINE instead a GL_QUAD (with width of 1 pixel) does the rasterizer also use 2x2 blocks?
  2. Grid sizes can be anything from 100x100 to 10000, 10000. Even though, there are only a few dozen parallel processors, the GPU performs badly when processing only small number of pixels in parallel. So for large grids (4000x4000) and up, the size of a column/row is more than 4000 pixels and performances are better. For small grids (e.g. 200x200) The GPU has only 200 pixels to process in each iteration and performance are bad.
  3. I cannot find an optimal arrangement pattern for my data in the grid in order to avoid the above mentioned problems. For example, if I want to avoid raster issues, I could arrange the data by dividing each column to 2 smaller columns and put them next to each other. This way, I can process a QUAD of (N/2)x2 instead of Nx1. The problem is that this arrangement is good for horizontal scans but not good for vertical ones. I could copy the data and rearrange it each time but I have a sense that it will harm performance. Am I right?

Any comments about overcoming the limitations of the hardware or rearranging the data (Does 3D texture might help??) would be appreciated.

Thank you!

If you use CUDA you don’t have to worry about rasterization - you control which thread processes what. I’d suggest reading through the CUDA Programming Guide to get an understanding of the programming model. It should resolve some of the difficulties you’re running into with traditional GPGPU.

You should see a speedup when switching to CUDA from shaders since your approach can take advantage of shared memory (most elements from the column to the “left” are used three times).

Paulius

Actually, I think you can perform each of your sweep in a single pass using CUDA. Since a CUDA thread can cache result in shared or local memory, and write global memory more than once.
For the memory layout, it’s possible to use a dynamic one. With some fiddling with shared memory, you can transpose your memory layout nearly for free at each right-left and bottom-up sweep (read-compute-write transposed). This way you can get coalesced memory read and write (which is usually near optimal performance) in all passes, including transposing ones.

However, small grids can be an issue. CUDA needs several thousands threads to be effective (for hiding latency, one need many threads per processing unit). For a 200*200 grid, it won’t likely perform very well.
Another issue is 7950 doesn’t support CUDA. 8800 is more expensive, but personally I strongly recommend it if you’re doing many scalar arithmetic. One of our scalar programs saw a near 10x speed up after switching to 8800.