Split Triangle Mesh into Multiple GAS

michael.i.zimm · May 1, 2024, 11:35pm

I have read the Programming Guide and scoured the forums and the internet for a sample code or clear tutorial on how to ray trace two or more GAS simultaneously. In my case I’m splitting a very large model (which exceeds the primitive-count limit for a single GAS) into smaller pieces. I think I need to create one SBT for each GAS, despite the raygen/miss/hit algorithms being the same for all pieces. But is there a clear example of this somewhere?

I’m going to try now to modify the optixTriangle example to render two triangles, each in its own GAS, but any suggestions would be appreciated.

droettger · May 2, 2024, 9:25am

First, splitting a GAS into multiple GAS should be straightforward.
You use the same routine calling optixAccelBuild but input fewer triangles.

Make sure that you partition these triangles into reasonably disjunct spatial blocks to make sure the resulting AABB over each GAS overlap as little as possible. A simple 3D grid would do.

Building multiple smaller GAS will also require less temporary memory.

If you really have a scene where a single GAS would exceed the maximum primitive count of 2^29 (== 536,870,912) that is a pretty huge model.
The acceleration structures are not small. I think I got about 625 MTris onto a 48 GB VRAM board in the past.
What is your system configuration?

Second, the SBT is rather flexible and depending on what you want to do (like what kind of materials you want to assign to that geometry) this would go from a single SBT hit record entry (all primitives use the same material shader, potentially with different parameters) to one SBT hit record entry per material shader (different primitives use a different material shaders, again potentially with different parameters) to one or more SBT hit record per instance.

I’m going to try now to modify the optixTriangle example to render two triangles, each in its own GAS, but any suggestions would be appreciated.

All my OptiX Advanced Examples are using a top-level instance acceleration structure (IAS) over multiple GAS.
They are linked inside the sticky posts of this sub-forum and many of the following threads link to source code explaining the different SBT layouts used.
Definitely read this one: https://forums.developer.nvidia.com/t/passing-per-vertex-attribute-data-into-a-shader-program/279321/2

The topic of splitting big GAS into multiple smaller ones has been discussed on this forum quite often.
https://forums.developer.nvidia.com/t/should-i-break-a-large-gas-into-multiple-smaller-ones/211603
https://forums.developer.nvidia.com/t/2-gases-insert-into-1-ias-problem/244155
https://forums.developer.nvidia.com/t/question-about-instance-acceleartion-struction/283898
https://forums.developer.nvidia.com/t/memory-consumption-relation-on-gas/256454
https://forums.developer.nvidia.com/t/decomposing-bvh-to-accelerate-traversal/283911/2

There have been many threads about different SBT layouts on the forum.
Start here and read all threads linked inside these posts as well:
https://forums.developer.nvidia.com/t/question-about-instance-acceleartion-struction/283898/4

https://forums.developer.nvidia.com/t/explaining-hitgroup-records-creation-in-sutil-vs-path-tracer/201777/2
https://forums.developer.nvidia.com/t/basic-question-world-and-object-coordinates/169725/5

michael.i.zimm · May 2, 2024, 1:50pm

@droettger Thank you for your very patient and informative response. It helped me figure out what my question actually is.

For (a smaller version of) my very large model I did create two GAS, each with half the faces of the original mesh. One GAS has buildInput.triangleArray.primitiveIndexOffset=0. The other has primitiveIndexOffset = nfaces/2.

	OptixBuildInput inputs[2];
	inputs[0] = {};
	inputs[1] = {};

	const uint32_t triangle_input_flags[1] = { OPTIX_GEOMETRY_FLAG_NONE };

	inputs[0].type						=	OPTIX_BUILD_INPUT_TYPE_TRIANGLES;
	inputs[0].triangleArray.vertexFormat			=	OPTIX_VERTEX_FORMAT_FLOAT3;
	inputs[0].triangleArray.numVertices			=	(uint32_t)nvertices;
	inputs[0].triangleArray.vertexBuffers			=	&d_vertices;
	inputs[0].triangleArray.flags				=	triangle_input_flags;
	inputs[0].triangleArray.numSbtRecords			=	(uint32_t)1;
	inputs[0].triangleArray.sbtIndexOffsetBuffer		=	(uint32_t)0;
	inputs[0].triangleArray.sbtIndexOffsetSizeInBytes	=	(uint32_t)0;
	inputs[0].triangleArray.sbtIndexOffsetStrideInBytes	=	(uint32_t)0;
	inputs[0].triangleArray.indexFormat			=	OPTIX_INDICES_FORMAT_UNSIGNED_INT3;
	inputs[0].triangleArray.numIndexTriplets		=	(uint32_t)nindices/2;				// we can safely divide by 2; number of triangles will always be even since we are building from quads
	inputs[0].triangleArray.primitiveIndexOffset		=	(uint32_t)0;					// offset is zero
	inputs[0].triangleArray.indexBuffer			=	d_indices;

	inputs[1].type						=	OPTIX_BUILD_INPUT_TYPE_TRIANGLES;
	inputs[1].triangleArray.vertexFormat			=	OPTIX_VERTEX_FORMAT_FLOAT3;
	inputs[1].triangleArray.numVertices			=	nvertices;
	inputs[1].triangleArray.vertexBuffers			=	&d_vertices;
	inputs[1].triangleArray.flags				=	triangle_input_flags;
	inputs[1].triangleArray.numSbtRecords			=	(uint32_t)1;
	inputs[1].triangleArray.sbtIndexOffsetBuffer		=	(uint32_t)0;
	inputs[1].triangleArray.sbtIndexOffsetSizeInBytes	=	(uint32_t)0;
	inputs[1].triangleArray.sbtIndexOffsetStrideInBytes	=	(uint32_t)0;
	inputs[1].triangleArray.indexFormat			=	OPTIX_INDICES_FORMAT_UNSIGNED_INT3;
	inputs[1].triangleArray.numIndexTriplets		=	(uint32_t)nindices/2;				// number of triangles will always be even since we are building from quads
	inputs[1].triangleArray.primitiveIndexOffset		=	(uint32_t)nindices/2;				// offset is one half of all triangles
	inputs[1].triangleArray.indexBuffer			=	d_indices;

	OptixAccelBufferSizes gas_buffer_sizes;
	OPTIX_CHECK(
		optixAccelComputeMemoryUsage(
			context,
			&accel_options,
			inputs,
			2, // Number of build inputs
			&gas_buffer_sizes
		)
	);

	CUdeviceptr d_temp_buffer_gas;
	CUDA_CHECK(
		cudaMalloc(
			reinterpret_cast<void**>( &d_temp_buffer_gas ),
			gas_buffer_sizes.tempSizeInBytes
		)
	);

	CUDA_CHECK(
		cudaMalloc(
			reinterpret_cast<void**>( &d_gas_output_buffer ),
			gas_buffer_sizes.outputSizeInBytes
		)
	);

	OPTIX_CHECK(
		optixAccelBuild(
			context,
			0,  // CUDA stream
			&accel_options,
			inputs,
			2,  // num build inputs
			d_temp_buffer_gas,
			gas_buffer_sizes.tempSizeInBytes,
			d_gas_output_buffer,
			gas_buffer_sizes.outputSizeInBytes,
			&gas_handle,
			nullptr,// emitted property list
			0   // num emitted properties
		)
	);

	CUDA_CHECK(
		cudaFree(
			reinterpret_cast<void*>( d_temp_buffer_gas )
		)
	);

With no other changes to the host or device code, now one half of the model is not occluded by the other half; which makes me think that one half of the mesh is somehow not registered correctly with Optix.

I did not make any changes to the SBT. Should I have added another raygen/hit/miss program? The programming guide says that each GAS in the buildInput will get a consecutive SBT record index. If I am using only one record, held over from my single-mesh code, then will both GAS just “see” that and use it? My suspicion is that the answer is no, but I am still climbing the learning curve, and unsure how to proceed.

droettger · May 2, 2024, 3:01pm

Nope, that is not at all how splitting a big GAS into multiple GAS works. What you did now is building a single GAS with two build-inputs. You only called optixAccelBuild once.

That will neither reduce the size of the GAS, nor will it work inside the optixTriangle example, because that has only one SBT hit record, but with your two build inputs you need two now because of this:

inputs[0].triangleArray.numSbtRecords			=	(uint32_t)1;
inputs[1].triangleArray.numSbtRecords			=	(uint32_t)1;

That’s all fine and dandy for two build inputs but that isn’t at all what you want or what the optixTriangle example supports.

Please take a step back and read all linked threads I posted first! Really, I mean it. I explained what you need inside them.

Let’s look at the overall render graph you need first.
There needs to be a top-level IAS and that needs to have multiple GAS referenced in its OptixInstance structures. Looks like this:

       IAS
     /  |  \
GAS_0  ...  GAS_N

You need one optixAccelBuild for each of these GAS_0 to GAS_N with only one build input describing the triangles in that one GAS.

Then there is one optixAccelBuild for the single IAS which in this case has N+1 OptixInstances which each reference one of the GAS.
If the triangles are all defined in world space, the matrices on the OptixInstance are all the identity.

The OptixTraversableHandle of the IAS is your top-level traversable argument used inside the optixTrace call inside the device code, so that handle needs to be put into your launch parameter structure to have it accessible.

The optixTriangle example doesn’t support that because it’s using the traversableGraphFlags OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_GAS
but the above render graph requires OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING.

Again, there are different ways how you can access the triangle data inside the device code and depending on what you choose, there need to be different amounts of SBT hit records.
Please read this first:
https://forums.developer.nvidia.com/t/passing-per-vertex-attribute-data-into-a-shader-program/279321/2
I would strongly recommend you use the second method for this case!

For the simple case of building GAS and putting them under an IAS, have a look at my simpler introductory example code.
This creates different GAS at runtime and places them under an IAS:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_runtime/src/Application.cpp#L1361
The individual create<Shape> functions there generate the (interleaved) vertex attributes and build a single GAS for each:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_runtime/src/Application.cpp#L1361
The IAS traversable handle m_root is finally put into the launch parameters.

Note that the intro examples are using one SBT hit record per OptixInstance inside the IAS. That is not necessary when you architect the SBT more like the rtigo12 example.

michael.i.zimm · May 8, 2024, 9:49pm

Thanks for the clarifications. And also thanks for reiterating that I should spend more time looking at your example code. The temptation is strong to ask for help first and look at code later.

I was able to rework the optixTriangle example to use two triangles, each in its own respective GAS, and then instancing each GAS with identity transform to an IAS. This is the only change I made, and I can now see both triangles on the screen!

	//
	// SPLIT TRIANGLES INTO TWO GAS, UNDER A SINGLE IAS
	//

	// Note that we also need to set:
	//                pipeline_compile_options.traversableGraphFlags = OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING;
	// and max traversal depth to 2:
	//    OPTIX_CHECK( optixPipelineSetStackSize( pipeline, direct_callable_stack_size_from_traversal,
        //                                            direct_callable_stack_size_from_state, continuation_stack_size,
        //                                             2  // maxTraversableDepth
        //                                             ) );

	OptixAccelBuildOptions accel_options = {};
	accel_options.buildFlags = OPTIX_BUILD_FLAG_NONE;
	accel_options.operation  = OPTIX_BUILD_OPERATION_BUILD;

	const std::array<float3, 4> vertices =
	{ {
		{ -0.5f, -0.5f, 0.0f },
		{  0.5f, -0.5f, 0.0f },
		{  0.0f,  0.5f, 0.0f },
		{  1.0f,  0.5f, 0.0f }
	} };

	const std::array<uint3, 2> indices1 =
	{ {
		{ 0, 1, 2 },
	} };
	const std::array<uint3, 2> indices2 =
	{ {
		{ 1, 3, 2 }
	} };

	CUdeviceptr d_vertices;
	CUdeviceptr d_indices1;
	CUdeviceptr d_indices2;

	const size_t vertices_size = sizeof(float3) * vertices.size();
	CUDA_CHECK( cudaMalloc( reinterpret_cast<void**>( &d_vertices ), vertices_size ) );
	CUDA_CHECK( cudaMemcpy( reinterpret_cast<void*>( d_vertices ), vertices.data(), vertices_size, cudaMemcpyHostToDevice ) );

	const size_t indices_size1 = sizeof(uint3) * indices1.size();
	CUDA_CHECK(cudaMalloc( reinterpret_cast<void**>( &d_indices1 ), indices_size1 ) );
	CUDA_CHECK(cudaMemcpy( reinterpret_cast<void*>( d_indices1 ), indices1.data(), indices_size1, cudaMemcpyHostToDevice ) );

	const size_t indices_size2 = sizeof(uint3) * indices2.size();
	CUDA_CHECK(cudaMalloc( reinterpret_cast<void**>( &d_indices2 ), indices_size2 ) );
	CUDA_CHECK(cudaMemcpy( reinterpret_cast<void*>( d_indices2 ), indices2.data(), indices_size2, cudaMemcpyHostToDevice ) );

	const uint32_t triangle_input_flags[1] = { OPTIX_GEOMETRY_FLAG_NONE };

	OptixTraversableHandle gas_handle1;
	CUdeviceptr d_gas_handle1;

	OptixTraversableHandle gas_handle2;
	CUdeviceptr d_gas_handle2;


	OptixBuildInput triangle_input1 = {};
	triangle_input1.type				= OPTIX_BUILD_INPUT_TYPE_TRIANGLES;
	triangle_input1.triangleArray.vertexFormat  	= OPTIX_VERTEX_FORMAT_FLOAT3;
	triangle_input1.triangleArray.numVertices   	= static_cast<uint32_t>( vertices.size() );
	triangle_input1.triangleArray.vertexBuffers 	= &d_vertices;
	triangle_input1.triangleArray.flags 		= triangle_input_flags;
	triangle_input1.triangleArray.numSbtRecords 	= 1;
	triangle_input1.triangleArray.indexFormat	= OPTIX_INDICES_FORMAT_UNSIGNED_INT3;
	triangle_input1.triangleArray.numIndexTriplets		= 1;
//	triangle_input1.triangleArray.primitiveIndexOffset	= 0;
	triangle_input1.triangleArray.indexBuffer	= d_indices1;

	OptixBuildInput triangle_input2 = {};
	triangle_input2.type				= OPTIX_BUILD_INPUT_TYPE_TRIANGLES;
	triangle_input2.triangleArray.vertexFormat	= OPTIX_VERTEX_FORMAT_FLOAT3;
	triangle_input2.triangleArray.numVertices	= static_cast<uint32_t>( vertices.size() );
	triangle_input2.triangleArray.vertexBuffers	= &d_vertices;
	triangle_input2.triangleArray.flags		= triangle_input_flags;
	triangle_input2.triangleArray.numSbtRecords	= 1;
	triangle_input2.triangleArray.indexFormat	= OPTIX_INDICES_FORMAT_UNSIGNED_INT3;
	triangle_input2.triangleArray.numIndexTriplets		= 1;
//	triangle_input2.triangleArray.primitiveIndexOffset	= 0;
	triangle_input2.triangleArray.indexBuffer	= d_indices2;

	OptixAccelBufferSizes gas_buffer_sizes;
	OPTIX_CHECK( optixAccelComputeMemoryUsage(
		context,
		&accel_options,
		&triangle_input1,
		1, // Number of build inputs
		&gas_buffer_sizes
	) );

	CUdeviceptr d_temp_buffer_gas;
	CUDA_CHECK( cudaMalloc(	reinterpret_cast<void**>( &d_temp_buffer_gas ), gas_buffer_sizes.tempSizeInBytes ) );
	CUDA_CHECK( cudaMalloc(	reinterpret_cast<void**>( &d_gas_handle1 ), gas_buffer_sizes.outputSizeInBytes ) );

	OPTIX_CHECK( optixAccelBuild(
		context, 0,  // CUDA stream
		&accel_options,
		&triangle_input1,
		1,  // num build inputs
		d_temp_buffer_gas, gas_buffer_sizes.tempSizeInBytes,
		d_gas_handle1, gas_buffer_sizes.outputSizeInBytes,
		&gas_handle1,
		nullptr, 0
	) );

	CUDA_CHECK( cudaFree( reinterpret_cast<void*>( d_temp_buffer_gas ) ) );

	OPTIX_CHECK( optixAccelComputeMemoryUsage(
		context,
		&accel_options,
		&triangle_input2,
		1, // Number of build inputs
		&gas_buffer_sizes
	) );

	CUDA_CHECK( cudaMalloc( reinterpret_cast<void**>( &d_temp_buffer_gas ), gas_buffer_sizes.tempSizeInBytes ) );
	CUDA_CHECK( cudaMalloc( reinterpret_cast<void**>( &d_gas_handle2 ), gas_buffer_sizes.outputSizeInBytes ) );

	OPTIX_CHECK( optixAccelBuild(
		context, 0,  // CUDA stream
		&accel_options,
		&triangle_input2,
		1,  // num build inputs
		d_temp_buffer_gas, gas_buffer_sizes.tempSizeInBytes,
		d_gas_handle2, gas_buffer_sizes.outputSizeInBytes,
		&gas_handle2,
		nullptr, 0
	) );

	CUDA_CHECK( cudaFree( reinterpret_cast<void*>( d_temp_buffer_gas ) ) );

	const float transform[12] =	// identity since all points are in world coordinates
	{
		1.0f, 0.0f, 0.0f, 0.0f,
		0.0f, 1.0f, 0.0f, 0.0f,
		0.0f, 0.0f, 1.0f, 0.0f
	};

	OptixInstance instances[2];

	instances[0] = {};
	memcpy(instances[0].transform, transform, sizeof(float) * 12);
	instances[0].instanceId		= 0;
	instances[0].visibilityMask	= OptixVisibilityMask(255);
	instances[0].sbtOffset		= 0;
	instances[0].flags		= OPTIX_INSTANCE_FLAG_NONE;
	instances[0].traversableHandle	= gas_handle1;

	instances[1] = {};
	memcpy(instances[1].transform, transform, sizeof(float) * 12);
	instances[1].instanceId		= 1;
	instances[1].visibilityMask	= OptixVisibilityMask(255);
	instances[1].sbtOffset		= 0;
	instances[1].flags		= OPTIX_INSTANCE_FLAG_NONE;
	instances[1].traversableHandle	= gas_handle2;

	CUdeviceptr d_instances;
	size_t instances_size = sizeof(OptixInstance) * 2;
	CUDA_CHECK( cudaMalloc(	(void**)&d_instances, instances_size ) );
	CUDA_CHECK( cudaMemcpy(	(void*)d_instances, instances, instances_size, cudaMemcpyHostToDevice ) );

	OptixBuildInput instance_input = {};
	instance_input.type = OPTIX_BUILD_INPUT_TYPE_INSTANCES;
	instance_input.instanceArray.instances = d_instances;
	instance_input.instanceArray.numInstances = 2;

	OptixAccelBufferSizes ias_buffer_sizes = {};
	OPTIX_CHECK(
		optixAccelComputeMemoryUsage(
			context,
			&accel_options,
			&instance_input,
			1,
			&ias_buffer_sizes
		)
	);

	CUdeviceptr d_ias;
	CUDA_CHECK( cudaMalloc( (void**)&d_ias, ias_buffer_sizes.outputSizeInBytes ) );

	CUdeviceptr d_tmp;
	CUDA_CHECK( cudaMalloc( (void**)&d_tmp, ias_buffer_sizes.tempSizeInBytes ) );

	OptixTraversableHandle ias_handle;
	OPTIX_CHECK(
		optixAccelBuild(
			context, 0,
			&accel_options,
			&instance_input,
			1,
			d_tmp, ias_buffer_sizes.tempSizeInBytes,
			d_ias,	ias_buffer_sizes.outputSizeInBytes,
			&ias_handle,
			nullptr, 0
		)
	);

	CUDA_CHECK( cudaStreamSynchronize( 0 ) );

	CUDA_CHECK( cudaFree( (void*)d_tmp ) );

	CUDA_CHECK( cudaFree( (void*)d_instances ) );

I am puzzled that I did not have to make any changes to the SBT. I expected that I would need to add a hit program entry for the second GAS.

Also, it was necessary to split the mesh indices into two device arrays, one for each triangle build input. I tried using a single device array with all (2 of) the index triplets, then using primitiveIndexOffset but that didn’t seem to work.

droettger · May 13, 2024, 7:17am

Please read this post which explains how the effective SBT index is calculated and why your setup worked:
https://forums.developer.nvidia.com/t/using-sbt-offsets-in-gas-in-program-with-only-one-sbt-record/285502/2

That post links to the corresponding chapter inside the OptiX Programming Guide.
It’s crucial to understand how the SBT indexing works to be able to develop any SBT layout matching your intended use case.

Spoiler, it’s because you set all OptixInstance sbtOffset fields to 0 inside your setup.

I posted this above before but this post and the links inside them are explaining IAS->GAS render graphs and SBT layouts:
https://forums.developer.nvidia.com/t/question-about-instance-acceleartion-struction/283898/4