If this is the wrong forum, please let me know.

We have just started working on this problem (new hardware is on order [gtx280/amd9950/gigabyte GA-MA790GP-DS4H]) and since the timeframe is tight we need to start designing before hardware is up and stable.

It is implemented already on a regular cpu but takes days and days running for larger datasets, even after a substantial amount of time optimizing it. I believe (hope? pray?) we can gain substantially (50x-100x?) by using the new hardware and cuda.

The problem goes like this: There are coordinates in 11-space (between about 100,000 to 2,000,000 sets of them depending…). Call this 2D coordinates array “c” and the number of coordinate sets in c “n”. There is an extreme number of simulation training runs (like hundreds of millions) on disk; call this 2D array “s.” Essentially, take a coordinate vector from s and transform it against each coordinate vector in c (the transform is similar to a modified distance formula). Then each resulting transform is compared against a standard and handled appropriately.

Our initial cuda approach is to store the complete array “c” into the 280. If some “c” is too big for 280 memory we will chunk it.

Now, run “n” number of threads so each thread produces one answer (transformed vector). Each thread receives, through function parameters, the 11 values from one “s” vector. Resulting “n” answers are copied back to the cpu. This paragraph is repeated with another coordinate vector from “s”.

In the meantime, another thread on a different cpu is doing the rest of the processing (which is not gpu-appropriate). We thought about switching the roles of “s” and “c” but aren’t convinced.

We know we will need to tweak block stuff. We thought about storing a chunk of “s” but didn’t understand a real advantage. Any comments on the overall approach? TIA.