Hey @jagj,
Your understanding sounds correct, you should only need to update the transform matrices and then do an UPDATE operation on your IAS.
When you say this operation is taking 1 second, you mean the host-side wall-clock timing of your updateAccel()
function?
If you copy all of your transforms, then the total data that needs to be transferred to the GPU should be 12 * 4 * 57000 = 2736000 bytes
. This should only take a small fraction of a second. I expect the optixAccelBuild()
call to also similarly take a very small amount of time.
I don’t see in your code snippet where the data transfer to the GPU is occurring. Are you copying each matrix separately?
The first thing I would recommend is running your application through Nsight Systems. This way you will be able to see how much time goes to each portion of the operation - host functions, data transfer, and optixAccelBuild. I’m guessing that almost all of your 1 second is occupied by the scene.getMeshes()
loop, and that the optixAccelBuild()
is only a tiny blip at the end.
From the code snippet, there are a couple of things I would suspect are taking nearly all of your time: geometryInstances[name]
is presumably calling a string hashing function or something? And TransformToOptixMax()
looks like it might be doing host-side memory allocation (is it constructing a new std::vector? is it doing a matrix multiply? is it copying data?). Any of those things will take time, but if it’s allocating memory, that is likely to be the primary culprit.
So if there are 57k meshes in the scene, keep in mind that the loop is at least 4 separate function calls, meaning over 200k host side function calls, and it looks like some of them are not inline and non-trivial. I also can’t tell if scene.getMeshes()
might be expensive.
There are some alternatives to handling your scene updates this way. One of them would be to have your simulation/update code write directly into your instance transform array. And then when it comes time to update the IAS, your updateAccel()
function would consist of only 1 cudaMemcpy()
and 1 optixAccelBuild()
,
A different alternative is to track which instances are updated and which instances are untouched. Then you can coalesce only the dirty transforms into blocks that will be each copied separately to the GPU. You would be able to loop over your meshes and immediately skip the non-dirty ones. Coalescing the consecutive dirty transforms into blocks adds some complexity, and whether it pays off depends on how many dirty transforms you usually need to handle on average, and how often your dirty transforms are side-by-side in memory. If you only have a very small number of transforms that are dirty any given frame, you could simply copy each and every dirty transform individually, so you would call cudaMemcpy()
once for every dirty transform.
First order of business for you, I think, is to do some profiling and get a better picture of what the timing is for all the sub-components of your function.
–
David.