After trying everything i could think of in 0.9 I am left with the problem of how to prevent the compiler putting a float4 struct in local when the whole point of using it was to get some level of coalescing while sequentially scanning a large matrix transpose. Even with a very big maxrregcount the resulting cubin still has the offending auto float4 in lmem. Simple copy examples seem to work but as soon as you do some computation on all individual elements off it goes to lmem.
The resulting code for the read does a v4.f32 read (which has 1/3 coalescing on 8800, 100% on 8600) then immediately 4 stores to local each with a stride of 16 bytes so only 1/4 coalesced then every reference to the elements of the float4 does a read with a stride of 16 bytes which is only 1/4 coalesced. The cubin sure looks like all these instructions are still there.
Even tried the “register” keyword but that is probably deleted by cpp these days.
Also I wonder why everyone gets 1/2 performance reading float4s compared to floats?
ed: This gets even stranger - if I put a vector store back to the original location with the same data at the end of the block of my code that used the 4 components then the store is done as a vector store from the $f registers that were picked up individually from local in the above segment. So there is absolutely no reason to put the data out to local - does not save a single register but the compiler insists upon doing it???