Performance of Acquire/ReleaseGLObjects

cyan · October 28, 2010, 4:03pm

Hello everyone, new guy here.

I’m having serious performance issues with my code when acquiring shared resources with OpenGL.

I have a kernel that does absolutely nothing:

[codebox]kernel void foo()

{

}

[/codebox]

Inside my display function I just do:

[codebox]void display (void)

{

static cl::KernelFunctor foo = foo_kernel.bind(queue, cl::NDRange(1,1), cl::NDRange(1,1));

queue.enqueueAcquireGLObjects(&shared_objects);

foo();

queue.enqueueReleaseGLObjects(&shared_objects);

glClear( GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT );

glutSolidTeapot(1.0f);

glutSwapBuffers();

}[/codebox]

The shared_objects vector contains a cl::Image2DGL declared as follows

[codebox]glEnable(GL_TEXTURE_2D);

glGenTextures(1, &shared_tex);

glBindTexture(GL_TEXTURE_2D, shared_tex);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F_ARB, TEXTURE_WIDTH,

			 TEXTURE_HEIGHT, 0, GL_RGBA, GL_FLOAT, NULL);

shared = cl::Image2DGL(context, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, shared_tex);

shared_objects.push_back(shared);

glBindTexture(GL_TEXTURE_2D, 0);[/codebox]

Now, to draw the teapot (without acquiring the resources before and after the kernel execution), it takes around 3.61ms

However, when acquiring/releasing, times seem to be directly influenced by the size of the texture even if the latter is never used (either in OpenCL or in OpenGL)

Texture Size - Time (ms)

256 ~4.36

1K ~4.67

4K ~4.9

16K ~4.5

64K ~4.4

256K ~4.9

1M ~6.1

4M ~15.2

16M ~57.3

System Info: Nvidia GTX465 1GB (drivers: 258.96) on an Intel Core 2 Duo @ 2.66 GHz with 2 GB RAM

As you can see, after 1M texture size really slows down the application. What is the driver doing? Is it copying the data between memory regions?

Is it normal behavior or am I doing something wrong?

Thanks

cyan · October 28, 2010, 4:03pm

Hello everyone, new guy here.

I’m having serious performance issues with my code when acquiring shared resources with OpenGL.

I have a kernel that does absolutely nothing:

[codebox]kernel void foo()

{

}

[/codebox]

Inside my display function I just do:

[codebox]void display (void)

{

static cl::KernelFunctor foo = foo_kernel.bind(queue, cl::NDRange(1,1), cl::NDRange(1,1));

queue.enqueueAcquireGLObjects(&shared_objects);

foo();

queue.enqueueReleaseGLObjects(&shared_objects);

glClear( GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT );

glutSolidTeapot(1.0f);

glutSwapBuffers();

}[/codebox]

The shared_objects vector contains a cl::Image2DGL declared as follows

[codebox]glEnable(GL_TEXTURE_2D);

glGenTextures(1, &shared_tex);

glBindTexture(GL_TEXTURE_2D, shared_tex);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F_ARB, TEXTURE_WIDTH,

			 TEXTURE_HEIGHT, 0, GL_RGBA, GL_FLOAT, NULL);

shared = cl::Image2DGL(context, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, shared_tex);

shared_objects.push_back(shared);

glBindTexture(GL_TEXTURE_2D, 0);[/codebox]

Now, to draw the teapot (without acquiring the resources before and after the kernel execution), it takes around 3.61ms

However, when acquiring/releasing, times seem to be directly influenced by the size of the texture even if the latter is never used (either in OpenCL or in OpenGL)

Texture Size - Time (ms)

256 ~4.36

1K ~4.67

4K ~4.9

16K ~4.5

64K ~4.4

256K ~4.9

1M ~6.1

4M ~15.2

16M ~57.3

System Info: Nvidia GTX465 1GB (drivers: 258.96) on an Intel Core 2 Duo @ 2.66 GHz with 2 GB RAM

As you can see, after 1M texture size really slows down the application. What is the driver doing? Is it copying the data between memory regions?

Is it normal behavior or am I doing something wrong?

Thanks

cyan · November 4, 2010, 10:45am

I’ve tested the same code on different machines with different OSs and the issue seems to be associated with Series 4 GPUs on Windows.

I’ve marked below with YES those system where texture size significantly influences acquire/release time (up tp 50 ms) and with NO those where it is irrelevant (no more than 1ms)

Windows Vista 32 bit, NVIDIA 465GTX, drivers 258.96 YES
Windows Vista 32 bit, NVIDIA 465GTX, drivers 260.99 YES
Windows Seven 32 bit, NVIDIA 465GTX, drivers 260.99 YES
Ubuntu 10.10 32 bit, NVIDIA 465GTX, drivers 260.19.06 NO

Windows Seven 32 bit, NVIDIA 480GTX, drivers 258.96 YES

Windows Seven 64 bit, NVIDIA 260GTX, drivers 260.99 NO

Judging by these results, I’d say that there’s a bug in the latest drivers…

cyan · November 4, 2010, 10:45am

I’ve tested the same code on different machines with different OSs and the issue seems to be associated with Series 4 GPUs on Windows.

I’ve marked below with YES those system where texture size significantly influences acquire/release time (up tp 50 ms) and with NO those where it is irrelevant (no more than 1ms)

Windows Vista 32 bit, NVIDIA 465GTX, drivers 258.96 YES
Windows Vista 32 bit, NVIDIA 465GTX, drivers 260.99 YES
Windows Seven 32 bit, NVIDIA 465GTX, drivers 260.99 YES
Ubuntu 10.10 32 bit, NVIDIA 465GTX, drivers 260.19.06 NO

Windows Seven 32 bit, NVIDIA 480GTX, drivers 258.96 YES

Windows Seven 64 bit, NVIDIA 260GTX, drivers 260.99 NO

Judging by these results, I’d say that there’s a bug in the latest drivers…