Good question. Is the content of R0 guaranteed to be sufficiently aligned? Are you using any instructions that prevent R31 and R32 from being interchanged (texture instructions come to mind)?
If you use float2 (instead of float) then LD.64 instructions are used, but if you use
typedef struct {float x, y;} f2t;
then LD.64 is NOT used! That is wrong! There should be no difference between float2 and f2t. So it looks like a hack: to fix the bug that tera links to, someone just went in there and added some hack code which ONLY fixes built-in types (like float2) and absolutely nothing else. :(
In fact, like the original post suggests, even ordinary back-to-back float copies should be coalesced into LD.64, it seems.
Data on GPUs needs to be accessed using the natural alignment of the data, that is, words need to be aligned on a word boundary, double words need to be aligned on a double-word boundary, and so on. Unaligned accesses lead to undefined behavior.
In general, a struct with two floats guarantees 4-byte alignment, since the size of a float is 4 bytes. A float2, however, has guaranteed 8-byte alignment. This then allows the compiler to safely generate the wider (64-bit) load instruction for a float2, while the struct of two floats needs to be accessed using two 32-bit loads, unless the compiler can prove the required alignment for the wider access by other means (it sometimes can do so when spilling registers, for example).
Programmers can force alignment of data with attributes and as far as I know this is how the float2 type is also implemented. See the following section in the CUDA C Programming Guide: 5.3.2.1.1 Size and Alignment Requirement
Nvidia / njuffa, I do stand corrected. Everything seems to work just great, exactly as you say, however, I did not find attribute((aligned(…))) in the guide. Is it documented somewhere?
Please refresh your browser :-) attribute(aligned) is how one would do this with gcc, in CUDA there is align. Within 10 minutes of posting I corrected my forum post to remove the erroneous portion and point at the relevant section in the Programming Guide. Please see edited post above.