I would if I had had an 8800 card (and if I could have gotten 64 bit drivers ).

Let’s say we use a Core 2 Extreme QX6700.

I would be happy with some typical or best case numbers, or anything regarding issue overhead, really.

I am looking to offload a computation that needs to run about 100 000 - 200 000 times per second on hardware like the above. Right now it’s running on the host CPU with loads of branches to make early exits when we can probably do so without affecting accuracy too much.

Obviously, we’d rather brute force the calculation (which boils down to some vector-matrix multiplications). It should parallelize almost perfectly, so it’s just a matter of throwing enough FLOPS at the problem.

Here’s an example workload. The size of the matrix math can be changed, we’ll just suffer (or gain in) accuracy. But the need to run 100k to 200k times per second remains.

input: 1 x 1024 floats

processing: f((1 x 1024) x (1024 x 1024)) x (1024 x 1) = (1 x 1)

(the 1024 x 1024 and 1024 x 1 matrices are constant during the runs so they only need to be uploaded once)

(f consists of a few multiplies and additions, you can ignore it)

output: single float

How fast can an 8800 do this? What if it’s 512, 256, 128, … element matrices? What if they’re non-power of 2, does that work efficiently?