shared memory loading

Hi,
I ran some bandwidth tests for a real data of mine and those are the results:
Host->Device [36,211,968 bytes] : [14.394]ms
Host->Device [362,119 bytes] : [0.153]ms

Device->Host [36,211,968 bytes] : [18.412]ms
Device->Host [362,119 bytes] : [0.196]ms

Device->Device [36,211,968 bytes] : [0.010]ms
Device->Device [362,119 bytes] : [0.005]ms

Where does the copy from gmem to smem stands with regard to host->device, device->host, device->device rates ?
The size above is approximatly what I load into smem from all blocks/threads I’m running and it took ~13ms (which is the same as Host->device) am I doing something wrong here?
Is the read/write wrong?

Thanks in advance

The code is this:
constant const int PRE_CALCULATED_CHUNK_SIZE = 256;
constant const int MAX_THREADS_PER_DEVICE_BLOCK = 256;
constant const int SHARED_SAMPLES_PER_BLOCK = 256;
constant const int MAX_CALCULATED_PAIRS_PER_BLOCK = 256;
constant const int CALCULAED_PAIR_SIZE = 4;

__shared__ float smSample1[ MAX_THREADS_PER_BLOCK ];
__shared__ float smSample2[ MAX_THREADS_PER_BLOCK ];

int iOutputIndex = 0;
int iBlockType = blockIdx.x % 3;
int iCurrentVel = blockIdx.x / 3;
int iTimeIndex = blockIdx.y * MAX_THREADS_PER_DEVICE_BLOCK + threadIdx.x;
if ( iTimeIndex >= nSamples )
	return;

if ( 0 == iBlockType )
{
	smSample1[ threadIdx.x ] = pDeviceSum[ iInputPos ];
	smSample2[ threadIdx.x ] = pDeviceSum[ iInputPos + nSamples ];
}
if ( 1 == iBlockType )
{
	smSample1[ threadIdx.x ] = pDeviceSum2[ iInputPos ];
	smSample2[ threadIdx.x ] = pDeviceSum2[ iInputPos + nSamples  ];
}
if ( 2 == iBlockType )
{
	smSample1[ threadIdx.x ] = pDeviceStack[ iInputPos ];
	smSample2[ threadIdx.x ] = pDeviceStack[ iInputPos + nSamples  ];
}
__syncthreads();
iOutputIndex = ... ; //some calculation depending on some parameters.
if ( 0 == iBlockType )
{
	pOut1[ iOutputIndex + iTimeIndex ] += smSample1[ threadIdx.x] * threadIdx.x;
	pOut2[ iOutputIndex + iTimeIndex ] += 3 * threadIdx.x - smSample2[ threadIdx.x ];
}
if ( 1 == iBlockType )
	pOut3[ iOutputIndex + iTimeIndex ] += smSample2[ threadIdx.x + delta ];
if ( 2 == iBlockType )
	pOut4[ iOutputIndex + iTimeIndex ] += smSample1[ threadIdx.x + delta ] + smSample2[ threadIdx.x + n + delta ];