I’m not up to date with DirectX’ structure alignment handling but if that structure is tightly packed, that wouldn’t match what CUDA requires and I wouldn’t be surprised if that wouldn’t work at all.
Please have a look into the CUDA alignment requirements. Look at Table 3. “Alignment Requirements in Device Code” here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types
CUDA has load instructions for float, float2, and float4 vectors, not for float3 which are handled as three floats.
Means loading a float4 is actually faster than loading a float3.
The alignment requirements are float4: 16 byte, float2: 8 byte, float: 4 bytes, means float3 also is 4 byte aligned.
Structures in arrays of structures are aligned to start the next structure on a 16 bytes alignment, at least that was documented in some CUDA programming guide in the past. I don’t find that requirement in the current docs right now. Assume it’s 16 bytes when there is a float4 member in the struct.
Means in a structure
there are actually invisible paddings required to access the float4 with 16 byte alignment and padding after the float2 to move the next struct in an array onto a 16 bytes aligned address. This is by no means a perfect match to a tightly packed buffer of floats.
About interleaved and individual buffers:
The only way to access a single buffer element, whatever that is, including a user defined struct, is the operator! Pointer arithmetic is not allowed on buffers!!!
That operator needs to resolve the buffer element address from the buffer variable and offset. This needs some instructions which means it can be faster to have fewer buffers. I would not recommend to put each vertex attribute into an own buffer simply for performance reasons.
Not sure at what count there is a turnover between buffer address calculations and non-coaleased memory accesses. Three or four buffers are probably not that bad. That would need to be measured per individual case.
There are multiple ways to handle this more efficiently:
Ugly but access and memory efficient: Change the position and normal to float4 and move the two texcoord floats to the additional .w components of position and normal:
That way the alignment is automatically perfect to 16 bytes with no implicit padding.
Loads will be fast float4 instructions.
Efficient access but using more memory: Make all your vertex attributes float4.
That’s what I’m doing in my renderers. I’m also using 3D texture coordinates or other data in the texture.zw slots when not using more than the texcord.xy.
Move the vertex position into an own buffer and the other attributes into an array of structs (interleaved attributes) using the methods in 1. and 2. above.
The intersection program is the most often called program. The intersection check only needs the positions and if you do not hit it, you save the access of the other attributes. Position accesses might be more coaleased that way. That might even work fine with float3.
Normally the bounding box program also only uses the positions, unless you do some sort of displacement mapping, and doesn’t need to know the other attributes’ buffer at all.