I wasn’t aware of this behavior until now, but I can reproduce it myself. After doing a little research, I found out that the issue here is that the CUDA runtime likes to hang on to the local (stack) memory allocation for a kernel. This apparent memory usage is not an explicit allocation on the part of
optixAccelBuild, it’s just a worst-case stack size that essentially gets cached by the driver in order to make future similar kernel executions faster (meaning they don’t have to spend time allocating more stack memory if it’s already available).
This is mentioned in passing in the CUDA launch configuration documentation:
“The CUDA driver automatically increases the per-thread stack size for each kernel launch as needed. This size isn’t reset back to the original value after each launch. To set the per-thread stack size to a different value,
cudaDeviceSetLimit() can be called to set this limit. The stack will be immediately resized, and if necessary, the device will block until all preceding requested tasks are complete.
cudaDeviceGetLimit() can be called to get the current per-thread stack size.”
There used to be a context flag to control this behavior, but it has been deprecated:
CU_CTX_LMEM_RESIZE_TO_MAX: Instruct CUDA to not reduce local memory after resizing local memory for a kernel. This can prevent thrashing by local memory allocations when launching many kernels with high local memory usage at the cost of potentially increased memory usage. Deprecated: This flag is deprecated and the behavior enabled by this flag is now the default and cannot be disabled. Instead, the per-thread stack size can be controlled with cuCtxSetLimit().
So if you need a workaround, we currently suggest using
cuCtxSetLimit to clear the stack allocation, with the slight wrinkle that you have to change the stack size in order to get it to stick (which is why there’s a +4 in the following example):
cuCtxGetLimit( &stackSize, CU_LIMIT_STACK_SIZE );
optixAccelBuild( ... ); // IAS build increases the stack size limit
stackSize += 4;
cuCtxSetLimit( CU_LIMIT_STACK_SIZE, stackSize ); // Reset stack size to (almost) what it was before
I’ve confirmed this works on the
optixCutouts sample. BTW, be aware the need for this workaround may be temporary, and behavior may change in future drivers. We may or may not try to reset the allocation in OptiX so it’s less visible. I don’t actually know yet why
optixPipelineCreate() appears to reset the stack size. I will try to find out, and I’ve also requested some clarification on what happens if you call
cudaMalloc with an amount that exceeds free memory after the stack size has been increased. People are all starting to leave for the holidays, so I might not get a response until January. I’m just speculating wildly here, but if the driver will volunteer to release the local memory allocation any time something else needs it, this might be an issue that causes memory accounting confusion but doesn’t actually count against your memory budget.