Nsight Compute reports error calling optixAccelComputeMemoryUsage

Without Nsight there are no errors as each CUDA and OptiX API is wrapped with error checking code. I tried reducing the AS to contain only a single object (made up of triangles) and removed vertex indexing as well, but had no luck.

Profiling the optixTriangle sample from the OptiX SDK works, whereas optixMeshViewer does not: a few seconds after resumiung profiling, Nsight disconnects and switches to an empty screen with the Projects dialog open. This behaviour of disconnecting is the same on half the launches of my app.

The setup is

  • Nsight Compute 2021.2 on Windows 10
  • CUDA 11.4 and OptiX 7.3 on Amazon Linux

End of Nsight API trace of app:

...
:,494,optixQueryFunctionTable,,OPTIX_SUCCESS(0),"(47, 0, 0x0, 0x0, 0x85b8c0, 304)",,,,
:,495,optixDeviceContextCreate,,OPTIX_SUCCESS(0),"(0x0, {0x000000000040AA20,0x0000000000000000,4,OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_OFF}, 0x7ffedb008d38{0x24c2580})",,,,
,496,optixAccelComputeMemoryUsage,,OPTIX_ERROR_INVALID_VALUE(7001),"(0x24c2580, {OPTIX_BUILD_FLAG_NONE,OPTIX_BUILD_OPERATION_BUILD,{0,0,0,0}}, {OPTIX_BUILD_INPUT_TYPE_TRIANGLES,{0x0000000004503900,8194,OPTIX_VERTEX_FORMAT_FLOAT3,0,0x7f9877418200,16384,OPTIX_INDICES_FORMAT_UNSIGNED_INT3,0,0x0,0x00007FFEDB009010,1,0x0,0,0,0}}, 1, {4237856,0,4})",,,,

Excerpt of Nsight API trace of optixTriangle:

...
:,113,optixQueryFunctionTable,,OPTIX_SUCCESS(0),"(47, 0, 0x0, 0x0, 0x6946a0, 304)",,,,
:,114,optixDeviceContextCreate,,OPTIX_SUCCESS(0),"(0x0, {0x0000000000409230,0x0000000000000000,4,OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_OFF}, 0x7ffd4dc10f18{0x223c9d0})",,,,
:,115,cudaMalloc,,cudaSuccess(0),"(0x7ffd4dc10f90{0x7f0ec7c00000}, 36)",,,,
:,116,cuMemAlloc_v2,,CUDA_SUCCESS(0),"(0x7ffd4dc10f90{0x7f0ec7c00000}, 36)",,,,
:,117,cudaMemcpy,,cudaSuccess(0),"(0x7f0ec7c00000, 0x7ffd4dc110a0, 36, cudaMemcpyHostToDevice(1))",,,,
:,118,cuMemcpyHtoD_v2,,CUDA_SUCCESS(0),"(0x7f0ec7c00000, 0x7ffd4dc110a0, 36)",,,,
:,119,optixAccelComputeMemoryUsage,,OPTIX_SUCCESS(0),"(0x223c9d0, {OPTIX_BUILD_FLAG_NONE,OPTIX_BUILD_OPERATION_BUILD,{0,0,0,0}}, {OPTIX_BUILD_INPUT_TYPE_TRIANGLES,{0x00007FFD4DC10F90,3,OPTIX_VERTEX_FORMAT_FLOAT3,0,0x0,0,OPTIX_INDICES_FORMAT_NONE,0,0x0,0x00007FFD4DC10F68,1,0x0,0,0,0}}, 1, {2432,640,0})",,,,
...

Kind regards, Jürgen.

Hi @otabuzzman,

The trace seems to be saying that one of the arguments to optixAccelComputeMemoryUsage() is bad. I’d recommend checking them carefully and also turning on OptiX validation mode (via OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL) - it may be able to reveal something in the build inputs or the setup that is causing the error. This function isn’t launching a kernel, so it’s unlikely to be causing an Nsight disconnect, but maybe Nsight is disconnecting intentionally after the API error.

Generally I would recommend building, profiling and debugging with CUDA 11.1 - which is the CUDA toolkit version mentioned in the OptiX 7.3 Release Notes. I have heard of a few issues people have with later toolkits. It might also be useful to roll back the Nsight Compute version to 2020.2.1 - the version that shipped with CUDA 11.1.


David.

Hi David,

rolled back to Nsight Compute 2020.2.1 and CUDA 11.1.1 (but kept current driver). Also applied OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL which gave me a hint:

OptiX API message :  2 :        ERROR : Validation mode caught builtin exception OPTIX_EXCEPTION_CODE_STACK_OVERFLOW
Error recording resource event on user stream (CUDA error string: unspecified launch failure, CUDA error code: 719)
exception: OPTX error: optixLaunch( pipeline, cuda_stream, d_lp_general, lp_general_size, &sbt, w , h , 1 )

Seems the problem does actually not occur in optixAccelComputeMemoryUsage but optixLaunch and is caused by improper stack usage, probably in one of the shaders. I’ll investigate and see if I can fix it.

Thx very much for your help.
Jürgen.

I would always recommend calculating the OptixPipeline stack size yourself.
It’s mandatory anyway when using direct or continuation callables because the built-in method doesn’t cover that.

Look for optixPipelineSetStackSize inside the OptiX SDK code examples and the Programming Guide.
Note that this calculation requires the maximum optixTrace recursion depth to be specified upfront.

Hi Detlef,

calculating and setting the various stack sizes according to the PG solved the problem and exposed a new one: OPTIX_EXCEPTION_CODE_TRACE_DEPTH_EXCEEDED but that was simply +1 thing. As far as OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL can tell there are no more errors in my app.

Unfortunately, CUDA 11.1.1 and Nsight Compute 2020.2.1 does not work either: it disconnects immediately after launch. Will further investigate this.

Thx for your help so far.
Jürgen.

To bring this thread to an end: one of the arguments to optixAccelComputeMemoryUsage was wrong. No more complaints from Nsight Compute after fixing.

The argument in question was OptixBuildInput, a union from which I use the OptixBuildInputTriangleArray member, which in turn is a struct that expects an integer array in its flags member. I assigned to the latter a variable from the stack which I mistakenly defined inside a for-loop body. The variable value was of course gone after the loop when optixAccelComputeMemoryUsage executes.

Kind regards
Jürgen

1 Like

Hey thanks for mentioning the reason! It helps us to hear what snags can happen so that we can think about improving our error messages and validation coverage. So in this case I’m not sure but I would guess an out of scope stack pointer is something we don’t have a way to detect, but I’ll ask the team anyway just in case.


David.