Host-device transfer bottleneck

hi

I am testing a collider using Optix, and noticed a huge bottleneck due
to several under the hood deallocations/allocations and Host-Device
transfers…
This happens everytime I launch the tracer and optix seems to be changing contexts.

Here,

it is pointed out that Optix and CUDA context can not share memory in
the actual GPU hardware, leading to the copying.
He mentions a workaround using an OpenGL buffer, but is this the only way?

this is how I am creating the buffers:

optix::Context context;
optix::Buffer buffer1, buffer2, buffer3;

void init()
{
context = optix::Context::create();
//...
//...
buffer1 = context->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_UNSIGNED_INT);
context["buffer1"]->setBuffer(buffer1);

buffer2 = context->createBuffer(RT_BUFFER_OUTPUT, RT_FORMAT_FLOAT4);
context["buffer2"]->setBuffer(buffer2);
buffer2->setSize(size2);

buffer3 = context->createBuffer(RT_BUFFER_INPUT_OUTPUT, RT_FORMAT_FLOAT);
context["buffer3"]->setBuffer(buffer3);
buffer3->setSize(size3);
}

//and this is how I get/set the data:

void update(unsigned int *usrData1, float *userData3, unsigned int size3)
{
buffer1->setDevicePointer(0,usrData1);
cudaMemcpyAsync(buffer3->getDevicePointer(0), userData3,
sizeof(float)*size3, cudaMemcpyDeviceToDevice)
}

float * result(){
return (float*)buffer2->getDevicePointer(0);
}

thank you.

Hi Burajimiru,

By default, this is by purpose, you could read Optix doc “9.2.1 Buffer synchronization” to find out how it works.

Basically, Optix tries to make ease of use for multiple GPUs, so it provides ways to manage memory between device and host for us. Moreover, it also works in other ways if you could add some flags to your code.

If you do not want buffer sync overtime, please use RT_BUFFER_COPY_ON_DIRTY.
If you want Optix to share buffers with CUDA, please use RT_BUFFER_GPU_LOCAL.

There are some caveats to know, thanks to PC architecture especially PCIE…

HIH,

Yashiz

Yes, I am not sure why the blog post claims that you cannot share memory between optix and cuda – after all, optix is built on cuda. CUDA interop API (eg, rtBufferSetDevicePointer) does allow you to share on-GPU memory between optix and an external CUDA application. You can see this in action with the optixOcean sample in the optix advanced samples here:

This sample shares memory between a simple optix-based renderer and cuda’s cufft library.

Thank you for your replies,
They helped me to realize a semantic mistake I was making.

I am using only 1 GPU in 1 machine, and all my data resides in the device. What I want is to read/write using device only, there is absolutely no need of host in my application (except during start up).

There are 4x3 + 1 buffers: particleStart(x,y,z), particleEnd(x,y,z), colliderStart(x,y,z), colliderEnd(x,y,z) and particleRadius.
I have changed all of them to OUTPUT because there is no need to synchronize them with the host.
After copying the data to the buffers, I launch the tracer and then process the result somewhere else.

here is the code being used now:

buffer = context->createBuffer(RT_BUFFER_OUTPUT | RT_BUFFER_COPY_ON_DIRTY, RT_FORMAT_FLOAT);
context[“buffer”]->setBuffer(buffer);
buffer->setSize(size);
buffer_data = (float*)buffer->getDevicePointer(0);

(and I never mark the buffers as dirty, so there should not be automatic synchronization)

and on write:

cudaMemcpyAsync(buffer_data, myData, sizeof(float)*size, cudaMemcpyDeviceToDevice)

If I am not mistaken, it should be enough to avoid host-device transfers, but this is the kind of result that I am getting:

[url]https://ibb.co/bF7i1H[/url]

the purple blocks are kernels in the application
the light blue bars are the d2d copies to set the buffers
Then, the GPU goes idle when it calls the cuMemFree_v2.
The first gap is followed by some h2d copies, memSets, and the refit.
The second gap is similar, followed by the MegaKernel.
The last gap copies back to host the results and then returns to the application

I believe the cuMemFree_v2 is being called because of the refit kernel. When I dont mark the acceleration structure as dirty, this step is skipped but the megaKernel is still making h2d transfers.

So, am I making any other mistake here?
Are those copies the internal data of the tree structure being moved around or just the buffers?
If optix is copying all the acceleration structure data back and forth, is there a way to make them reside permanently on GPU?

In the section 7.2.3 of the doc, it mentions that calling rtBufferGetDevicePointer may return a pointer to zero-copy memory, but this is valid only for multi-GPU contexts, right?
If not, how could I get device pointers without calling rtBufferGetDevicePointer?

if you need more details, let me know. I can put everything here later.

thanks