Optimize Host->GPU Transfer Times?

I’m trying to optimize an application that already uses Optix to determine which facets of complex 3-D models are visible. The code creates about 400,000 rays, then executes the Optix query, which takes about 0.5 seconds (Tesla K40c) on Ubuntu Linux.

Does it make sense to look at the host->GPU transfer time for the data structure that contains the 400,000 rays? Can that be improved? My guess is that it can’t, but a smart person told me to look into this. Is there anything I can do to improve that? Currently, the transfer happens automatically within the API call to execute the Optix query, so I don’t even know if it’s possible to optimize the data transfer.

I’m basically clueless about how the transfer works, so if it does not make sense to look into this, feel free to tell me.

Thanks for any advice in advance.

NOTE: I should have mentioned in the original post that I’m using Optix Prime. I only need to cast rays.

How big is your buffer of rays in bytes? (And how big is your buffer of hit results?)

Assuming you’re using PCI 3, you should be getting at least 10GB/s transfer rates in practice, so for an uncompressed buffer of rays comprised of float3 origin & direction – in other words 24 bytes per ray – you should expect to see the buffer transfer happen in around 1 millisecond.

If my napkin math is anywhere close to your situation, it means your trace time is what’s dominating. How complex is your scene? Are you using transforms & instancing?

A few options to consider:

If you still suspect your transfer times, you could verify it by doing the transfer manually using cudaMemcpy(), and time the transfer and the trace separately.

If generating your rays on the GPU is an option, you might be able to eliminate the host->GPU transfer by writing a ray generation program in CUDA.

The main way to reduce transfer times is to reduce the size of the buffer, which means either removing anything you don’t need, or compressing your buffer and having a program on the GPU for decompression.


I don’t know exactly how big the rays are in bytes, but I think you are probably right, or within a factor of 2. 400,000 rays at 24 bytes is about 10 MB. 10 GB/sec transfer rates seems high to me, but I don’t know. Is that what people achieve in practice?

Do we know that Optix transfers all the rays at once? Alternatively, it might transfer them on the fly in smaller batches. If so, the transfer time would be a lot higher.

If I were to generate the rays on the GPU, then how would I input them to the Optix Prime query function? As far as I can tell, I don’t see any hook in the API for the query function to take the ray structure as input if the ray structure is already on the GPU. Can that be done with Optix Prime?

Thanks for your reply.

With Prime, you specify whether your buffer is located on the host or device when you create the buffer.

As far as transfer rates, your transfer rate is determined by how many lanes you have. I’m assuming you have 16 lanes, which means you have a theoretical peak of 15.8 GB/s, and in practice you should get above 10GB/s pretty easily. https://en.wikipedia.org/wiki/PCI_Express#History_and_revisions

The OptiX Prime buffer copy does happen in a single copy, it is not split into batches.


By the way, you could attempt to use a page-locked buffer and see if that helps.

I suspect that transfer times aren’t your issue, but let me know what you find.