in the 87 page of Cuda Programming guide version 2.3.1
there are following explanation.
" center : misaligned float memory access reulting in one transaction." for the Cuda 1.2 or higher
but in my case the following code gives the results that says misaligned float memory access resulting in 16 transaction. global void offsetcopy( float odata, float idata, int offset)
{
int xid=blockIdx.x*blockDim.x +threadIdx.x +offset;
odata[xid]=idata[xid]
}
for offset=1,2,…15 , 8x performance degradation arise.
this means 16 transactions are issued per half warp.
what happens for this misaligned memory access ?
is programming guide wrong?
But that isn’t what it says. It says “Examples of Global Memory Access by Devices with Compute Capability 1.2 and Higher” [emphasis mine]. My guess is your card is a compute 1.1 capability card (GF8xxx, GF9xxx, GTS1xx, GTS2xx or the Quadro/Telsa equivalent).
Right now only the GTX200 desktop series (and Quadro equivalents) and the Tesla C1060/S1070 series are compute 1.3 capable. There have been some compute 1.2 mobile parts announced, but they probably won’t be in anything you can buy until the end of the year.