serious performance degradation when acquiring opengl-textures


I am working on the following case: there are 3 passes in every frame update of my program:
pass #1: opengl outputs some color, normal and position values to an off-screen texrure
pass #2: opencl acquires ownership of these textures and processes the values in them to apply some lighting effect
and the resuting pixel is written to an output texture.
pass #3: the output-texture is drawn to the screen onto a full-screen quad to show the end-results.

now the problem is that the performance drops considerably when the view-port and hence all the off-screen textures are resized (size of the off-screen texture and the size of the view-port are the same).

First I thought this might be due to increased number of pixels that need to be processed (long execution time)
But after some experiments I realized that the problem was related to the size of the bound textures!

To test this I wrote a very trivial kernel that directly outputs a color to the output-texture:
so there won’t be any long per-pixel processing and a static color value is written directly to an output texture:
the kernel-execution time is not dependent on the input and is only limited by the write-to-texture speed!
then I varied the number of textures acquired by the opencl at full resolution (1680x1050 pixels)

I added the following textures in the given order and measured the total kernel preparation and execution times (starting from acquiring the opengl objects, setting the arguments, launching the kernel, releasing the opengl objects and calling clFinish: as recommeneded in nvidia opencl guides)

  1. output texture (RGBA8)
  2. color-map (RGBA8)
  3. position-map (RGBA32F)
  4. normal-map (RGBA32F)

I measured the following kernel times with the trivial kernel:
bound opengl objects:1 -> 8 ms
bound opengl objects:1,2 -> 15.7 ms
bound opengl objects:1,2,3 -> 45.9 ms
bound opengl objects:1,2,3,4 -> 75.7 ms

I think the problem is obvious.
the first two textures are of the same size, they are both RGBA8
adding on of the rgba8-textures results in about 7-8 ms overhead…
the last two textures are RGBA32F, adding on of them causes about 30ms of overhead…
the latency seems to be proprtional to the memory consumed by the texture.

I am also sure that I am using the opengl interoperability extension correctly:
I checked the extension-string and I performed memory read tests: they all passed!
I also checked if the opencl-images are created from the opengl-textures correctly
they are also fine (I am getting the desired color in my output-texture)
so I do not think I am doing something wrong in setting up the opencl-opengl interface.

I expected this to work as fast as the shader version of this kernel funstion but it does not…
it seems that some internal memory operations are done under the hoods.
what could be the cause of the overhead when acquiring the opengl-textures by opencl?
can we overcome it?

driver 258.96
opencl v1.0
msvc 2003 .net

I guess I found the reason:
my gtx-295 works in dual-gpu mode and the two gpus are named A and B in nvdia control panel.
I guess what happens is follows:
opengl runs on one gpu (say A) while opencl runs on the other (say B).
so if I want to share resources with opencl, then these resources have to be copied from gpu A to gpu B.

I disabled the dual-gpu mode and the latency is suddenly gone!
I think this supports the above thoughts (and maybe this is also why OptiX requires single-gpu mode in GT-295)

so important lesson:
if you have dual-gpu’s (like gtx-295) take this into consideration!