Hey everyone,
I’m back to analyzing what part of our algorithm to optimize next - and the biggest performance hit right now appears to be OpenGL…
We have a couple of 320x240 images in OpenGL, 24bpp and 16bpp (so, 225kB & 150kB), which we need to have access to in our CUDA kernels once per frame.
At present, locking these images still takes 150-350s, and unlocking takes the same amount of time - for an accumulated time of 300-700us just to lock/unlock the buffers each time (twice per frame, this sometimes approaches 1.5ms, which is ludicrous - however tends to average 600-700us per frame)…
From what I understand, since CUDA 2.1 - transfer of OpenGL memory to CUDA memory wasn’t supposed to go via system memory (unless transferring between GPUs), however the performance I’m seeing is less than half my PCI express bandwidth - and barely a few % of my device bandwidth - so I’m having trouble understanding what the hell is going on…
Each buffer is created only once (at start of application), and registered/mapped just before use, and unmapped/unlocked just after use.
Mapping/Unmapping generally takes 20-50us, registering generally takes 150-300us, and unregistering generally 100-250us… excluding spikes (which can sometimes reach 800us just to register, another 400 to unregister, etc).
I get more or less similar performance on 8600, 8800, and Quadro FX 570 - all using CUDA 2.1, all on Windows XP 32bit…
Is this ‘performance’ due to the fact OpenGL interop still runs over the PCI bus (despite what I recall reading?), or is it the fact my data is too small to transfer - so I’m actually seeing the latency of the memory copy - and thus not achieving full bandwidth?
I’m guessing it’s the former, because I’m assuming the latter would reach far far more than ~1.4% of peak bandwidth (which is what I’m getting, ~600mb/s out of ~45GB/s)… right?
Or even worse, is this simply the expected performance for opengl interop? :blink: