I am designing an algorithm for solving a specific type of partial differential equation.
The algorithm is essential sequential but I managed to turn it into a parallel one for the sake of exploiting the GPU. Hardware available for me is either 7950GTX or 8800GTX.
Without getting into technical details of the algorithm, I would like to explain briefly the algorithm.
I have a 2D array with the current state of the solution to the PDE. In order to solve the equation, I need to perform the following task:
Process a single column of the array. All the elements/pixels in the column can be processed in parallel. The pixels in the column depend only on pixels from the column on the left and the current pixel which is being processed. Each pixel depend on it self (i,j), the pixel on the left (i-1,j), the left up (i-1,j+1) and left down (i-1, j-1).
After I complete to scan the grid column by column, left to right I need to perform the same task 3 more times:
One scan of columns from right to left. Another scan of rows from top to bottom and last, a scan of rows from bottom to top.
So completely I have 4 sweeps of the plane in 4 directions.
The challenges I have with respect to GPU hardware limitations are:
- Rendering a single line is inefficient since raster hardware work in 2x2 blocks. Half of the processors might be idle. I am using CG right now. Does CUDA has this limitation or does it by pass the fixed raster functionality? When using GL_LINE instead a GL_QUAD (with width of 1 pixel) does the rasterizer also use 2x2 blocks?
- Grid sizes can be anything from 100x100 to 10000, 10000. Even though, there are only a few dozen parallel processors, the GPU performs badly when processing only small number of pixels in parallel. So for large grids (4000x4000) and up, the size of a column/row is more than 4000 pixels and performances are better. For small grids (e.g. 200x200) The GPU has only 200 pixels to process in each iteration and performance are bad.
- I cannot find an optimal arrangement pattern for my data in the grid in order to avoid the above mentioned problems. For example, if I want to avoid raster issues, I could arrange the data by dividing each column to 2 smaller columns and put them next to each other. This way, I can process a QUAD of (N/2)x2 instead of Nx1. The problem is that this arrangement is good for horizontal scans but not good for vertical ones. I could copy the data and rearrange it each time but I have a sense that it will harm performance. Am I right?
Any comments about overcoming the limitations of the hardware or rearranging the data (Does 3D texture might help??) would be appreciated.