DirectX->Optix single geometry buffer or multiple?


I have a question about buffers. I have the vertex struct in directx11 vertexbuffer
XMFLOAT3 position;
XMFLOAT3 normal;
XMFLOAT4 tangent;
XMFLOAT2 texcoord;

the struct has perfect alignament in the optix. I like to know how a can register several geometries (vertexbuffers) on optix. What better choice?

  • Create several buffers (vertex/index). one for each geometry (several buffers on device)
  • Create a single buffer for vertex/index and append all the data

if the second method is the best, how a can change the value os indexbuffer based on vertexbuffer appended?

vertex geometry1 ( v(0,0,0) v(-1,-1,0) v(1, -1, 0) )
index geometry1 (0, 1, 2)

vertex geometry2 ( v(0,0,10) v(-1,-1,10) v(1, -1, 10) )
index geometry2 (0, 1, 2)

in this case a single buffer would look like:

vertexbuffer ( v(0,0,0) v(-1,-1,0) v(1, -1, 0) v(0,0,10) v(-1,-1,10) v(1, -1, 10) )
indexbuffer ( 0, 1, 2, 3, 4, 5 )


I use multiple buffers (one by geometry). Is no reason to use one single buffer.

I’m not up to date with DirectX’ structure alignment handling but if that structure is tightly packed, that wouldn’t match what CUDA requires and I wouldn’t be surprised if that wouldn’t work at all.

Please have a look into the CUDA alignment requirements. Look at Table 3. “Alignment Requirements in Device Code” here:

CUDA has load instructions for float, float2, and float4 vectors, not for float3 which are handled as three floats.
Means loading a float4 is actually faster than loading a float3.

The alignment requirements are float4: 16 byte, float2: 8 byte, float: 4 bytes, means float3 also is 4 byte aligned.
Structures in arrays of structures are aligned to start the next structure on a 16 bytes alignment, at least that was documented in some CUDA programming guide in the past. I don’t find that requirement in the current docs right now. Assume it’s 16 bytes when there is a float4 member in the struct.

Means in a structure

  float3 position; 
  float3 normal;
  float4 tangent;
  float2 texcoord;

there are actually invisible paddings required to access the float4 with 16 byte alignment and padding after the float2 to move the next struct in an array onto a 16 bytes aligned address. This is by no means a perfect match to a tightly packed buffer of floats.

About interleaved and individual buffers:
The only way to access a single buffer element, whatever that is, including a user defined struct, is the operator! Pointer arithmetic is not allowed on buffers!!!
That operator needs to resolve the buffer element address from the buffer variable and offset. This needs some instructions which means it can be faster to have fewer buffers. I would not recommend to put each vertex attribute into an own buffer simply for performance reasons.
Not sure at what count there is a turnover between buffer address calculations and non-coaleased memory accesses. Three or four buffers are probably not that bad. That would need to be measured per individual case.

There are multiple ways to handle this more efficiently:

Ugly but access and memory efficient: Change the position and normal to float4 and move the two texcoord floats to the additional .w components of position and normal:

  float4 position_texcoordu;
  float4 normal_texcoordv;
  float4 tangent;

That way the alignment is automatically perfect to 16 bytes with no implicit padding.
Loads will be fast float4 instructions.

Efficient access but using more memory: Make all your vertex attributes float4.
That’s what I’m doing in my renderers. I’m also using 3D texture coordinates or other data in the slots when not using more than the texcord.xy.

Move the vertex position into an own buffer and the other attributes into an array of structs (interleaved attributes) using the methods in 1. and 2. above.
The intersection program is the most often called program. The intersection check only needs the positions and if you do not hit it, you save the access of the other attributes. Position accesses might be more coaleased that way. That might even work fine with float3.
Normally the bounding box program also only uses the positions, unless you do some sort of displacement mapping, and doesn’t need to know the other attributes’ buffer at all.

I realise several tests with memory alignment because a want use the interop with DirectX11 VertexBuffer/IndexBuffer. And after many attempts this struct works fine.

Sorry I did not specified the side of struct layout. You’re right.

Struct on the host side (Directx Struct):
xmfloat3 position;
xmfloat3 normal;
xmfloat4 tangent;
xmfloat2 texcoord;

Struct on the device side (CUDA)
struct Vertex
float3 Position; //aling(4)
float3 Normal; //aling(4)
float3 Tangent; //aling(4)
float2 TexCoord; //aling(8)

The tangent on host is 4 floats and in the device is 3 floats. Use the same layout for DirectX and optix is complicated, but the advantage is that I do not need to reload the geometries for optix.