Issue overhead for different APIs and GPUs

Hi all,

I am curious as to what the issue overhead is of doing
a calculation via CUDA.

For example, in the extreme case of sending 1 float to
the card, doing 1 math operation, and reading 1 float
back, what is the maximum number of times we can run
this cycle per second?

Does this vary beween CUDA and pure OpenGL/DirectX?

Is it expected to improve with better drivers or are
there hard limits caused by hardware/protocols?

Is there variation on the hardware?

Derived from this question: for what size matrix
multiplications does CUDA start bringing an advantage
over the CPU?

It depends on a lot of factors, including your CPU speed. Have you measured it?

Yes, they are different drivers.

Yes, we hope to improve all aspects of performance over time.

Yes, mostly due to CPU performance. We find that Intel Conroe (Core Duo) CPUs perform best currently.

I’m not sure, I haven’t measured that.


I would if I had had an 8800 card (and if I could have gotten 64 bit drivers :))).

Let’s say we use a Core 2 Extreme QX6700.

I would be happy with some typical or best case numbers, or anything regarding issue overhead, really.

I am looking to offload a computation that needs to run about 100 000 - 200 000 times per second on hardware like the above. Right now it’s running on the host CPU with loads of branches to make early exits when we can probably do so without affecting accuracy too much.

Obviously, we’d rather brute force the calculation (which boils down to some vector-matrix multiplications). It should parallelize almost perfectly, so it’s just a matter of throwing enough FLOPS at the problem.

Here’s an example workload. The size of the matrix math can be changed, we’ll just suffer (or gain in) accuracy. But the need to run 100k to 200k times per second remains.

input: 1 x 1024 floats

processing: f((1 x 1024) x (1024 x 1024)) x (1024 x 1) = (1 x 1)

(the 1024 x 1024 and 1024 x 1 matrices are constant during the runs so they only need to be uploaded once)

(f consists of a few multiplies and additions, you can ignore it)

output: single float

How fast can an 8800 do this? What if it’s 512, 256, 128, … element matrices? What if they’re non-power of 2, does that work efficiently?

The following results are for windows, host machine is Intel Core2 Duo at 2.4 ghz. Device is a 8800GTX.

Note that these are all “order of magnitude” results.

A function call to the device that immediately returns takes about 30 us.

Copying one byte to the device is about 15 us. About the same for copying back.

Transfer speeds saturate after about 64KB.

Thanks a lot Greg.

That’s about 70us overhead. Or no more than 15 000 calls per second. Bummer :(

If you are doing lots of small calls to the GPU then you aren’t using it efficiently. It is designed to process data in large chunks of parallelism. Apps that perform well on the GPU will not be making anywhere near 15000 calls per second.

It is also not designed to have small amounts of data transfered to and from the data for every call. Typically you load large arrays onto the device, run many kernels on that data, keeping intermediate data on the device, and finally read back the results to the host.

FWIW, I typically see 20-25us overhead per call.


Can someone clarify: what is the overhead (in terms of registers, clocks, etc.) to calling a device function within the kernel?

Are such function calls inlined, or is there some kind of call stack?

There is no call stack. All device functions are inlined.