why float4 in vbo generation kernels

Ok so alot of the nvidia cuda tutorials use float4’s for vectors and normals toggling the w to 1 and 0 respectively. While I understand the reasons behind that, I wonder if for large meshes, memory can be saved using float3’s instead, as they are supported by the OpenGL API.

My question is, if I use float3 instead, will the graphics hardware still expand each vertex to a 4-tuple to handle the 4x4 transformations? In other words, will I save any cuda memory by changing my vbo generation code to use float3 instead? And will it come at a performance penalty?

The only drawback to float3 is that compute capability 1.0 and 1.1 GPUs cannot do coalesced reads of this data type. The memory controller on those older chips would issue many uncoalesced transactions when a warp accessed a row of float3 elements. There are tricks to work around this, like using pointer casting to read floats into a shared memory buffer, but the easiest approach is to find a compute capability 1.2 or 1.3 card and not worry about it. :)

(Edit: Ooops, I missed that you were asking about OpenGL interop. My statement still stands for CUDA generically, but there may be further issues I don’t know about.)