Bad perfomance of GL/CL interop on Windows

I have an application that runs mixed fragment shader and opencl processing on opengl textures, using heavily the cl/gl interop.
The application runs fast on AMD cards, both Linux and Windows. On NVIDIA cards, it runs fast on Linux, but very slowly on Windows 7. The problem seems to be enqueueAcquireGLObjects and enqueueReleaseGLObjects. I have created a minimal sample, demonstrating the bad performance by simply:

  1. Creating 2 OpenGL textures (1600x1200 pixel, RGBA float)
  2. Creating 2 OpenCL images, sharing the 2 textures
  3. repeatedly (50 times) acquiring and releasing the OpenCL image

Results (mean time for executing acquire, release, finish)

  • AMD HD 6980, Linux: <0.1 ms
  • AMD HD 6980, Win7: <0.1 ms
  • NVIDIA GTX590, Linux: <0.1 ms
  • NVIDIA GTX590, Win7 : 16.0 ms

I have tried several different drivers from nvidia, from older 295.73 to current beta drivers 326.80, all showing the same behaviour. I also tried different GeForce cards, ranging from 480 to 680. My problem can not be related to bad support for OpenCL on nvidia cards, because it runs fine on linux. It also can not be related to bad code for Win7, since it runs fine on AMD.

Below is the relevant code bits, i could provide full code. It is a toy example of course and only ment to be demonstrating the problem with as little code as possible…

Here is the relevant code for texture creation:

width_ = 1600;
height_ = 1200;

float *data = new float[ 1600*1200*4 ];

textures_.resize(2);
glGenTextures(2, textures_.data());

for (int i=0;i<2;i++) {
  glBindTexture(GL_TEXTURE_2D, textures_[i]);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
  // "data" pointer holds random/uninitialized data, do not care in this example
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F_ARB, width_,height_, 0, GL_RGBA, GL_FLOAT, data);
}

delete data;
{ // create shared CL Images
#ifdef CL_VERSION_1_2
  clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
  clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#else
  clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
  clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#endif
}

And here is the relevant code for acquire/release cycle

try {
  queue_->enqueueAcquireGLObjects( &clImages_ );
  queue_->enqueueReleaseGLObjects( &clImages_ );
  queue_->finish();
} catch (cl::Error &e) {
  std::cout << e.what() << std::endl;
}

I posted the same question on stackoverflow a few days ago, but it seems no one can help me there…
http://stackoverflow.com/questions/18492425/opencl-gl-interop-slow-on-nvidia-win-but-fast-on-linux