interop with opengl is quite slow

Before the problem occurs, I created Optix buffers by mapping all vertices, normals, texCoords and indices data to host and then use these buffers to create an Optix geometry

mOptix.mIndices = ctx->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_INT3, this->mNumTriangles);
mOptix.mVertices = ctx->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_FLOAT3, this->mNumVertices);
mOptix.mNormals = ctx->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_FLOAT3, this->mNumVertices);
mOptix.mTexCoords = ctx->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_FLOAT2, this->mNumVertices);

// map buffers
int32_t* indices = reinterpret_cast<int32_t*>(mOptix.mIndices->map());
float* vertices = reinterpret_cast<float*>(mOptix.mVertices->map());
float* normals = reinterpret_cast<float*>(mOptix.mNormals->map());
float* texcoords = reinterpret_cast<float*>(mOptix.mTexCoords->map());

	// fill indices
	for (size_t i = 0;i < this->mNumTriangles;i++)
		for (size_t j = 0;j < 3;j++)
			indices[i * 3 + j] = mTriIndices[i * 3 + j];

	// fill vertices
	for (size_t i = 0;i < mNumVertices;i++)
		for (size_t j = 0;j < 3;j++)
			vertices[i * 3 + j] = mVertices[i * 3 + j];
			normals[i * 3 + j] = mNormals[i * 3 + j];

	// fill texcoords
	for (size_t i = 0;i < mNumVertices;i++)
		for (size_t j = 0;j < 2;j++)
			texcoords[i * 2 + j] = mTexCoords[i * 2 + j];

// unmap buffers

However, the performance degradation occurs when I tried to replace the code above and create Optix buffers and Optix geometry using created OpenGL buffers.

mGl.mVerticesBuffer = make_shared<OpenglBuffer>();
mGl.mNormalsBuffer = make_shared<OpenglBuffer>();
mGl.mTexCoordsBuffer = make_shared<OpenglBuffer>();
mGl.mIndicesBuffer = make_shared<OpenglBuffer>();

glNamedBufferData(mGl.mVerticesBuffer->mHandle, sizeof(float) * mNumVertices * 3, &(mVertices[0]), GL_STATIC_DRAW);
glNamedBufferData(mGl.mNormalsBuffer->mHandle, sizeof(float) * mNumVertices * 3, &(mNormals[0]), GL_STATIC_DRAW);
glNamedBufferData(mGl.mTexCoordsBuffer->mHandle, sizeof(float) * mNumVertices * 2, &(mTexCoords[0]), GL_STATIC_DRAW);
glNamedBufferData(mGl.mIndicesBuffer->mHandle, sizeof(uint32_t) * mNumTriangles * 3, &(mTriIndices[0]), GL_STATIC_DRAW);

// setup vertices
mOptix.mVertices = context->createBufferFromGLBO(RT_BUFFER_INPUT, mGl.mVerticesBuffer->mHandle);

// setup normals
mOptix.mNormals = context->createBufferFromGLBO(RT_BUFFER_INPUT, mGl.mNormalsBuffer->mHandle);

// setup texcoords
mOptix.mTexCoords = context->createBufferFromGLBO(RT_BUFFER_INPUT, mGl.mTexCoordsBuffer->mHandle);

// setup indices
mOptix.mIndices = context->createBufferFromGLBO(RT_BUFFER_INPUT, mGl.mIndicesBuffer->mHandle);

It seems that it caused by numerous calls of cuGraphicsUnmapResources.
Is there some way to fix or avoid this problem?

What’s your system configuration?
OS version, installed GPU(s), display driver version, OptiX version (major.minor.micro), CUDA version.

To be able to confirm benchmarks, absolute performance numbers and a method how they have been generated would be needed.

If you found that this is due to the CUDA unregister and register or unmap calls, how often are you changing these vertex attributes?

A way to reduce that number of calls would be to put the vertex attributes into one buffer if you exchange them all the time anyway. I normally try to keep the number of OptiX buffers minimal.

If many of these buffers exist for different geometries, the rtGeometrySetPrimitiveIndexOffset() function allows to combine vertex attributes for multiple geometries into a single buffer.

If this is the actual geometry of the scene, then the acceleration structures would also need to be rebuilt or refit every time, (which would also happen when doing that via host buffers).

(In the host version code block, I would recommend to use memcpy() to copy the four arrays if they are tightly packed.)

Thank you for your answer.

Here is my system configuration.

OS:                    Windows 10 Home 64-bit
GPU:                   NVIDIA Geforce GTX 1070
NVIDIA Driver Version: 385.54
Optix Version:         4.1.1
CUDA Version:          9.0.176

I use NSight to measure the performance and view CUDA Driver API Call Summary.
(Additional Note: regardless of how I create Optix buffer, I always have unusually high cuMemcpyDtoHAsync_v2.)

While I have a lot of buffers, these buffers are for static geometry inside the scene. I created Optix Buffer (from OpenGL buffer) only once during the setup. In my rendering loop the only there’s a single function that call Optix commands.

My program buffers data from CPU to GPU only once during setup.

optix::GeometryGroup mOptixTopGeometryGroup;
void renderPathTrace()
    mOptixContext->launch(0, mResolution.x, mResolution.y);

In every frame, I call the function above and I use opengl pbo approach to display data inside mOptixGlTbo.

Is there some way to avoid rebuilding acceleration structures in every frame?

One more thing that I would like to know is that what kind of Optix configurations or parameters that could possibly lead to large numbers of cuGraphics[Map/Unmap]Resources commands and cuMemcpyDtoHAsync_v2 so I can avoid using these configurations and parameters.

Generating PTX code for OptiX 4.1.1 with CUDA 9.0 is not officially supported.
If the code doesn’t change from what CUDA 8.0 would have generated, you’re fine though.
If things start to fail because of PTX errors, try downgrading to CUDA 8.0.

If you only upload that data once there won’t be AS rebuilds or refits per frame.

I would not expect a larger amount of map/unmap calls for graphics resources if you’re using these buffers in OptiX without marking them dirty every frame.
I don’t actually know if using the interop buffers in OpenGL at the same time would need these calls. That simply sounds wrong.

Maybe it’s just this:
Do not set these per frame if they don’t change! That should happen during scene setup once.


Minimize the number of OptiX calls inside your render loop. There should only be the launch() and some variables set when actually needed, e.g. the iteration counter inside a progressive renderer or the camera parameters only if they changed. Nothing else.

How does it perform when not running it with NSight?

Could you please quantify the perfomance change and number of unexpected calls in absolute numbers?