Understanding OptiX internal memory use

Hi, as mentioned here: Is there a way to know how much GPU memory Optix will use?,

you can see exactly how much memory OptiX will use for acceleration structures because you are allocating the device memory for that depending on the optixAccelComputeMemoryUsage results.

However, taking a look at Optix 7.7 SDK optixCutouts example (I’ve run into this elsewhere as well), I’ve noticed there is a bunch of GPU memory allocated by the instance optixAccelBuild call (400+ MB on linux with an RTX A4000, driver 535.129.03, cuda 12.1 and 1.2+ GB on linux with an RTX 4090, driver 535.129.03, cuda 12.1) . In that example, this memory seems to be freed during the optixPipelineCreate call. I’ve seen this using cudaMemGetInfo() and nvidia-smi.

Additionally, if I want to rebuild the IAS after creating the pipeline (i.e. transforms change significantly), this memory wouldn’t be freed, meaning a trivial scene could be using far more device memory than any that this explicitly allocated in the demo. This can be seen if you add an extra call to buildInstanceAccel after the rest of the scene is setup in optixCutouts.

What is this memory being allocated for and is there a way to change this or have it freed right away if freeing at the call to optixPipelineCreate is associated with loading something into the driver? Thank you in advance!

Hi @amelmquist,

I wasn’t aware of this behavior until now, but I can reproduce it myself. After doing a little research, I found out that the issue here is that the CUDA runtime likes to hang on to the local (stack) memory allocation for a kernel. This apparent memory usage is not an explicit allocation on the part of optixAccelBuild, it’s just a worst-case stack size that essentially gets cached by the driver in order to make future similar kernel executions faster (meaning they don’t have to spend time allocating more stack memory if it’s already available).

This is mentioned in passing in the CUDA launch configuration documentation:

“The CUDA driver automatically increases the per-thread stack size for each kernel launch as needed. This size isn’t reset back to the original value after each launch. To set the per-thread stack size to a different value, cudaDeviceSetLimit() can be called to set this limit. The stack will be immediately resized, and if necessary, the device will block until all preceding requested tasks are complete. cudaDeviceGetLimit() can be called to get the current per-thread stack size.”

There used to be a context flag to control this behavior, but it has been deprecated:

CU_CTX_LMEM_RESIZE_TO_MAX: Instruct CUDA to not reduce local memory after resizing local memory for a kernel. This can prevent thrashing by local memory allocations when launching many kernels with high local memory usage at the cost of potentially increased memory usage. Deprecated: This flag is deprecated and the behavior enabled by this flag is now the default and cannot be disabled. Instead, the per-thread stack size can be controlled with cuCtxSetLimit().

So if you need a workaround, we currently suggest using cuCtxGetLimit & cuCtxSetLimit to clear the stack allocation, with the slight wrinkle that you have to change the stack size in order to get it to stick (which is why there’s a +4 in the following example):

size_t stackSize; 
cuCtxGetLimit( &stackSize, CU_LIMIT_STACK_SIZE ); 
 
optixAccelBuild( ... ); // IAS build increases the stack size limit 
 
stackSize += 4; 
cuCtxSetLimit( CU_LIMIT_STACK_SIZE, stackSize ); // Reset stack size to (almost) what it was before

I’ve confirmed this works on the optixCutouts sample. BTW, be aware the need for this workaround may be temporary, and behavior may change in future drivers. We may or may not try to reset the allocation in OptiX so it’s less visible. I don’t actually know yet why optixPipelineCreate() appears to reset the stack size. I will try to find out, and I’ve also requested some clarification on what happens if you call cudaMalloc with an amount that exceeds free memory after the stack size has been increased. People are all starting to leave for the holidays, so I might not get a response until January. I’m just speculating wildly here, but if the driver will volunteer to release the local memory allocation any time something else needs it, this might be an issue that causes memory accounting confusion but doesn’t actually count against your memory budget.


David.

Hi David,

Thank you for the quick and thorough response! This is very helpful and clarifies my understanding of the memory usage here.

I tried using cuCtxGetLimit and cuCtxSetLimit and this does as you mentioned, reducing the memory use after the build is complete. However, cuCtxSetLimit seems to be a very expensive operation given it has to be called after every rebuild of the IAS, which for a dynamic scene could be every frame (seems like order milliseconds per call to cuCtxSetLimit). Additionally, when setting the stack limit, this significantly increases the time to perform IAS rebuilds by multiple orders of magnitude (presumably due to reallocation of the local memory?). So this temporary workaround or having OptiX automatically do this seems like it might not be a great option performance-wise (but this is helpful if memory becomes a bigger problem than time).

Given this performance difference, I understand the trade-off. However, the memory use is substantial. I haven’t verified if this counts against the application’s memory budget, but it does significantly limit the available memory on the device for other applications or running parallel renders. Do you know why optixAccelBuild would prompt allocation of so much local memory for running the build with only a few objects? Without many insights, 0.5-1.2 GB seems like a lot for this example, and this seems to be about the same even if I reduce the example to just a single sphere.

Thank you for you help!

Do you know why optixAccelBuild would prompt allocation of so much local memory for running the build with only a few objects?

Unfortunately, the memory usage is not dependent on the number of objects, it’s a stack size configuration that depends on how many SMs you have in your GPU and how much stack space is needed, it basically has to assume that it will require the stack space for all cores on the GPU, and every byte tends to add up quickly.

I agree the size seems like a lot and could be limiting. I’ll check with the team whether this is something we could mitigate by adding some API hints to pass more information to the builder to allow it to use less space. In your case, if you need to build the IAS every frame, then I guess that there’s little point to resetting the stack size after the build; you will be subject to the build kernel’s local memory usage any time the kernel has to execute.

So we’ll see if we can reduce the usage. No guarantees, I don’t know what’s involved yet or what options we have, but be aware that even if we can improve the situation, with our current lead times and QA & release schedules, this might take a while before you’ll get any relief, sorry about that. Do let us know if becomes a serious blocker for you.


David.

1 Like

I’ll be on the lookout if something changes or if hints can be added. Thank you for helping and providing an explanation on the build and local memory usage.