I wrote a program to test the performance of optixAccelBuild under various input conditions. Each run consists of generating a buffer of 2^26 triangles, then starting five builds in sequence, each followed by an OptiX pipeline invocation. The output matches the expected output, so the code is very likely bug-free.
However, the first build always takes significantly longer than the remaining four builds. This pattern of “one long, four short” repeats every run. It does, however, not occur when I’m testing with less than 2^26 triangles.
I’m aware that the first CUDA API call of a program triggers initialization, and I suspected this to be the cause of the issue at first. However, the code in question neither contains the first CUDA instruction, nor the first OptiX instruction of the program.
Can you explain this weird behavior? I attached a profiling screenshot below.
Interesting question, I don’t have a theory yet, but can you tell me what OS this is? Is this the display GPU?
When you say it does not occur with less than 2^26 triangles, can you elaborate? Does the long one become short with a smaller buffer suddenly, or in proportion to the buffer size? If it’s a threshold, what is the threshold, approximately? How big is your buffer of 2^26 triangles? How much free memory do you have when you start the build? Have you tried using CUDA_LAUNCH_BLOCKING=1? This might clarify what the first BVH build is waiting for.
It might be a clue that the optixAccelBuild call itself shows up taking a long time. This function, like most OptiX API calls is asynchronous, and so it will usually return almost immediately unless it has to block for some reason just to initiate a CUDA request. What this may indicate is that whatever you did before calling the build is still processing, and that the build has to synchronously wait for the preceding workload to complete. Another possibility perhaps is that if you’re on Windows, WDDM holds up your memory requests for some reason. This can happen on display GPUs, especially in low memory situations.
This is a screenshot from the first of five runs when the number of triangles is 2^25. It seems as if optixAccelBuild launches some kernels and terminates immediately afterwards. This only takes a few us, while the kernels take 90 ms in total. Also, there is no elongated host-only section in this case.
I’m not sure if there is a hard threshold somewhere between 2^25 and 2^26, but it’s probably not 2^26 itself. I tested with (2^26)-100 triangles, and the issue still occurred.
I have another codebase (let’s call it codebase B) which runs the same experiment under almost identical conditions, and the issue does not appear there, every run takes just 160 ms, including the first one. The only difference is that codebase B runs optixAccelBuild in a stream other than the default stream. Therefore, I tried launching the build from a newly-created stream in codebase A, but the issue still persisted.
I had a similar idea regarding the synchronous wait, but I’m running cudaDeviceSynchronize() before starting the build, which should wait for all pending tasks (afaik). The host code block somewhat looks like a cached JIT compilation to me, but I don’t recall OptiX using JIT for building.
I’ll run some more experiments with CUDA_LAUNCH_BLOCKING=1 and memory profiling, and post my findings to this thread.
I just had a weird run where I was calling cudaStreamCreate before optixAccelBuild, and this time, cudaStreamCreate took very long, but optixAccelBuild worked as expected.
However, the old problem returned in the next batch of five runs. Overall, it seems as if the issue might not be related to OptiX. Ideally, there would be a way to force this host-only workload at some point, so that my benchmarks remain unaffected.
Regarding memory size, 2^26 triangles occupy around 2.4 GB, and there are 42GB of free memory when the build starts. I’m looking forward to your response.
Thanks for the extra details. Okay, so it sounds like the issue is possibly related to streams. I guess the very next thing to try is to synchronize the device both before and after the stream create, maybe that will help. Part of what I’m wondering is whether there’s work on the other stream that is blocking the new stream from either finishing initialization, or from starting new work. This can happen if you have memory allocation operations in progress, for example, like cudaMalloc or cudaFree. Certain things can cause all streams to behave synchronously.
I’m running cudaDeviceSynchronize() before starting the build, which should wait for all pending tasks (afaik).
That’s right, and to be clear this is exactly what setting CUDA_LAUNCH_BLOCKING=1 will do for you. The benefits of the environment variable are that you don’t have to write any code or rebuild, and it applies to all kernels. It could help in this case if there is some cascade of blocking behavior all waiting on some operation on a different stream that you weren’t suspecting. Forcing all work to be synchronous will give you a clearer picture of which things are actually taking a long time.
Hi, sorry for the late reply! I tried everything you suggested and then some, but I was still unable to find a reason for this weird behavior. However, I have come to the conclusion that this is a CUDA issue, not an OptiX issue. At some point in the future, I’ll probably reopen a similar thread in the general CUDA forums. Thanks for your help.