I modified postProcessGL example from SDK to measure bandwidth from cuda to opengl. It works by generating image in cuda kernel and then transferring it to OpenGL context trough PBO. Nothing is transferred from cpu->cuda or OpenGL → cuda. Modified source can be loaded from here.
It seems that I get really bad results, only 390MB/s. Am I doing something wrong or is this really that slow?
It seems that bandwidth is about equal to what I get by transferring data from cuda to cpu to opengl… So is this what cuda drivers do currently?
I’m mostly running quadhead with two such cards or two 9600GT:s, but this has been also tested on machine with only one 8800GTS. Workstation is Sun Ultra 40M2.
I try the program with new CUDA 2.0 on my Quadro FX6500 and the result is 465Mbps.
It is improved vs 390Mbs, but not good enough so that we can exploit OpenGL function as a part of computing process.
I still can not understand why the speed is far from the optimal 70Gb for device to device memory bandwidth, and even lower than from host memory to device memory even with pageable memory (1278Mb)