After trying everything i could think of in 0.9 I am left with the problem of how to prevent the compiler putting a float4 struct in local when the whole point of using it was to get some level of coalescing while sequentially scanning a large matrix transpose. Even with a very big maxrregcount the resulting cubin still has the offending auto float4 in lmem. Simple copy examples seem to work but as soon as you do some computation on all individual elements off it goes to lmem.
The resulting code for the read does a v4.f32 read (which has 1/3 coalescing on 8800, 100% on 8600) then immediately 4 stores to local each with a stride of 16 bytes so only 1/4 coalesced then every reference to the elements of the float4 does a read with a stride of 16 bytes which is only 1/4 coalesced. The cubin sure looks like all these instructions are still there.
Even tried the “register” keyword but that is probably deleted by cpp these days.
Thanks,
Eric
Also I wonder why everyone gets 1/2 performance reading float4s compared to floats?
ed: This gets even stranger - if I put a vector store back to the original location with the same data at the end of the block of my code that used the 4 components then the store is done as a vector store from the $f registers that were picked up individually from local in the above segment. So there is absolutely no reason to put the data out to local - does not save a single register but the compiler insists upon doing it???
In my experiences with CUDA I also found that using float4/float3/… types wasn’t such a good idea. Sometimes they even result in buggy behaviour, and as the G80 is a scalar processor they do not give you any gain. You could maybe fix this by using an array of floats instead of a structure.
In the process of preparing the repro I found the exact trigger for my problem so I will document it here rather than through a bug report as it may help another user immediately.
With decl “float4* pf4” ALL cases of read and increment of pf4 work EXCEPT “pf4++" which triggers the problem ("++pf4” “pf4[0]; ++pf4;” etc all work OK).
Is there a WOPT flag to nvopencc to turn off local? (+ something to ptxas to stop it doing anything) - if not any plans?