I am curious as to what the issue overhead is of doing
a calculation via CUDA.
For example, in the extreme case of sending 1 float to
the card, doing 1 math operation, and reading 1 float
back, what is the maximum number of times we can run
this cycle per second?
Does this vary beween CUDA and pure OpenGL/DirectX?
Is it expected to improve with better drivers or are
there hard limits caused by hardware/protocols?
Is there variation on the hardware?
Derived from this question: for what size matrix
multiplications does CUDA start bringing an advantage
over the CPU?
I would if I had had an 8800 card (and if I could have gotten 64 bit drivers External Media).
Let’s say we use a Core 2 Extreme QX6700.
I would be happy with some typical or best case numbers, or anything regarding issue overhead, really.
I am looking to offload a computation that needs to run about 100 000 - 200 000 times per second on hardware like the above. Right now it’s running on the host CPU with loads of branches to make early exits when we can probably do so without affecting accuracy too much.
Obviously, we’d rather brute force the calculation (which boils down to some vector-matrix multiplications). It should parallelize almost perfectly, so it’s just a matter of throwing enough FLOPS at the problem.
Here’s an example workload. The size of the matrix math can be changed, we’ll just suffer (or gain in) accuracy. But the need to run 100k to 200k times per second remains.
input: 1 x 1024 floats
processing: f((1 x 1024) x (1024 x 1024)) x (1024 x 1) = (1 x 1)
(the 1024 x 1024 and 1024 x 1 matrices are constant during the runs so they only need to be uploaded once)
(f consists of a few multiplies and additions, you can ignore it)
output: single float
How fast can an 8800 do this? What if it’s 512, 256, 128, … element matrices? What if they’re non-power of 2, does that work efficiently?
If you are doing lots of small calls to the GPU then you aren’t using it efficiently. It is designed to process data in large chunks of parallelism. Apps that perform well on the GPU will not be making anywhere near 15000 calls per second.
It is also not designed to have small amounts of data transfered to and from the data for every call. Typically you load large arrays onto the device, run many kernels on that data, keeping intermediate data on the device, and finally read back the results to the host.