Cuda -> OpenGL bandwidth

I modified postProcessGL example from SDK to measure bandwidth from cuda to opengl. It works by generating image in cuda kernel and then transferring it to OpenGL context trough PBO. Nothing is transferred from cpu->cuda or OpenGL -> cuda. Modified source can be loaded from here.

It seems that I get really bad results, only 390MB/s. Am I doing something wrong or is this really that slow?

It seems that bandwidth is about equal to what I get by transferring data from cuda to cpu to opengl… So is this what cuda drivers do currently?

What CUDA version are you using? CUDA 2.0 is supposed to improve this, you might try the beta if you haven’t already.

I’m already running v2 beta… Been waiting for Cuda 2 for exactly this reason :(

(Bandwidth was similar with Cuda 1.1. )

Which GPU are you using and how much memory does it have?

GeForce 8800 GTS with 640MB.

I’m mostly running quadhead with two such cards or two 9600GT:s, but this has been also tested on machine with only one 8800GTS. Workstation is Sun Ultra 40M2.

I’m running Linux (Centos5.1).

Did anyone try that program? What kind of results are you getting?

I try the program with new CUDA 2.0 on my Quadro FX6500 and the result is 465Mbps.

It is improved vs 390Mbs, but not good enough so that we can exploit OpenGL function as a part of computing process.

I still can not understand why the speed is far from the optimal 70Gb for device to device memory bandwidth, and even lower than from host memory to device memory even with pageable memory (1278Mb)