Preferred alignment for buffers

Some of the API types have documented alignment requirements, but for those that don’t, e.g. vertex and index data, output colour buffers etc., is there a preferred alignment requirement for performance?

Yes, and it depends on your data formats.
All device memory allocations must adhere to the CUDA alignment rules for the built-in types, otherwise it won’t work at all, but you’ll get a CUDA misaligned address error instead.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types__alignment-requirements-in-device-code

CUDA Driver API cuMemAlloc() resp. CUDA Runtime API cudaMalloc() say “The allocated memory is suitably aligned for any kind of variable.” Those should return 256-byte aligned addresses.
Because of that, an arena allocator might come in handy for lots of small device memory allocations, also for performance.

Note that only 1-, 2-, and 4-component vectors have vectorized load and store commands. Means reading a float4 is normally faster than reading a float3 which is effectively treated as three floats.

The recommendation to not accumulate (read-modify-write) into float3 when using pinned host memory given in previous OptiX versions esp. from multiple devices still applies. That’s a PCI-E effect.

Thanks Detlef! Yeah I’m in the process of adding allocator support to my high-level optix bindings, hence the question.

As a follow-up, what exactly is special about the built-in vector types aside from the alignment? You mention vectorized load/store… is it possible to get the same efficiency with a custom vector type?

“As a follow-up, what exactly is special about the built-in vector types aside from the alignment?”

There are matching PTX specifiers for the vectorized instructions. You can find them inside the PTX ISA documentation in the CUDA Toolkit docs folder, e.g. ptx_isa_6.4.pdf in CUDA 10.1, when looking for “v2” and “v4” for load and store (ld, st), chapter 9.7.8 Data Movement and Conversion Instructions.

“You mention vectorized load/store… is it possible to get the same efficiency with a custom vector type?”

Unlikely, at least not without explicit design for that purpose.
The different alignment restrictions of a float4 (align 16) and four floats (align 4) would prohibit automatic compile time optimizations for vectorized accesses.
You would either need to code especially for the built-in vector types to indicate the alignment requirements are granted during compile time or you’d need to strictly align all your user defined structs to the same alignment the CUDA compiler uses to optimize the data accesses.
Check the CUDA header vector_types.h for how that defines the float4 for example.

Using the align specifiers on user defined structs are hopefully enough to let the compiler generate more optimal accesses. (I haven’t tried that, I use the built-in types.)

PTX is just an intermediate assembly-like representation. How the actual microcode looks like in the end, can only be seen in the final SASS code.

With PTX input code compiled with --generate-line-info and OptixModuleCompileOptions::debugLevel set OPTIX_COMPILE_DEBUG_LEVEL_LINEINFO, Nsight Compute 2019.4 or newer should be able to show the OptiX 7 CUDA source code, the PTX code, and the matching SASS instructions for the user provided modules side by side.
That will really tell you if the compiler generated vectorized instructions.

Here’s a list of SASS instructions for the different GPU architectures.
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref

(Warning: Never profile with Nsight Compute when OptixModuleCompileOptions and OptixPipelineCompileOptions are not at full optimizations!
With OPTIX_COMPILE_DEBUG_LEVEL_FULL there, the assembly will look completely differently, esp. wrt. to instruction reordering.
The bottlenecks will become different, for example, where a path tracer is normally memory access limited, in debug it might be instruction fetch limited due to missing interleaving of parallel instructions, etc.)