You still get RTX hardware traversal even when you don’t use the hardware triangle intersection, so yes you are using the RT cores and the BVH format is still hardware specific.
There are no guarantees on either upper bound children per leaf, nor traversal order of leaf nodes. Note that traversal order of leaf nodes is always ambiguous, even if they were traversed in a known order. With a given ray, you can always arrange the primitives in two leaf nodes such that one leaf is traversed before the closest intersection point is found in the other leaf.
The buffer size of 32 in optixParticleVolumes is a only convenient number, it is not related in any way to the number of children per leaf node. It’s just a good power of two for doing the parallel bitonic sort, and it’s large enough to usually/mostly capture enough transparent particles that you don’t see any artifacts.
I’m curious what decisions you can make to speed things up if you could count the children in a leaf node? If your use case is general, we will certainly consider making adjustments to support what you need.
If you want to guarantee you have exactly 8, or no more than 8 children in a leaf node, here is what I would recommend trying: do a single pass clustering algorithm of your own and write a custom intersection program that handles 8 children at a time. This way you have control over what you consider a leaf node and what’s inside of it, and you can still use the RT cores in a way that is agnostic to implementation.