shared memory loading

eyalhir74 · October 24, 2008, 9:12pm

Hi,
I ran some bandwidth tests for a real data of mine and those are the results:
Host->Device [36,211,968 bytes] : [14.394]ms
Host->Device [362,119 bytes] : [0.153]ms

Device->Host [36,211,968 bytes] : [18.412]ms
Device->Host [362,119 bytes] : [0.196]ms

Device->Device [36,211,968 bytes] : [0.010]ms
Device->Device [362,119 bytes] : [0.005]ms

Where does the copy from gmem to smem stands with regard to host->device, device->host, device->device rates ?
The size above is approximatly what I load into smem from all blocks/threads I’m running and it took ~13ms (which is the same as Host->device) am I doing something wrong here?
Is the read/write wrong?

Thanks in advance

The code is this:
constant const int PRE_CALCULATED_CHUNK_SIZE = 256;
constant const int MAX_THREADS_PER_DEVICE_BLOCK = 256;
constant const int SHARED_SAMPLES_PER_BLOCK = 256;
constant const int MAX_CALCULATED_PAIRS_PER_BLOCK = 256;
constant const int CALCULAED_PAIR_SIZE = 4;

__shared__ float smSample1[ MAX_THREADS_PER_BLOCK ];
__shared__ float smSample2[ MAX_THREADS_PER_BLOCK ];

int iOutputIndex = 0;
int iBlockType = blockIdx.x % 3;
int iCurrentVel = blockIdx.x / 3;
int iTimeIndex = blockIdx.y * MAX_THREADS_PER_DEVICE_BLOCK + threadIdx.x;
if ( iTimeIndex >= nSamples )
	return;

if ( 0 == iBlockType )
{
	smSample1[ threadIdx.x ] = pDeviceSum[ iInputPos ];
	smSample2[ threadIdx.x ] = pDeviceSum[ iInputPos + nSamples ];
}
if ( 1 == iBlockType )
{
	smSample1[ threadIdx.x ] = pDeviceSum2[ iInputPos ];
	smSample2[ threadIdx.x ] = pDeviceSum2[ iInputPos + nSamples  ];
}
if ( 2 == iBlockType )
{
	smSample1[ threadIdx.x ] = pDeviceStack[ iInputPos ];
	smSample2[ threadIdx.x ] = pDeviceStack[ iInputPos + nSamples  ];
}
__syncthreads();
iOutputIndex = ... ; //some calculation depending on some parameters.
if ( 0 == iBlockType )
{
	pOut1[ iOutputIndex + iTimeIndex ] += smSample1[ threadIdx.x] * threadIdx.x;
	pOut2[ iOutputIndex + iTimeIndex ] += 3 * threadIdx.x - smSample2[ threadIdx.x ];
}
if ( 1 == iBlockType )
	pOut3[ iOutputIndex + iTimeIndex ] += smSample2[ threadIdx.x + delta ];
if ( 2 == iBlockType )
	pOut4[ iOutputIndex + iTimeIndex ] += smSample1[ threadIdx.x + delta ] + smSample2[ threadIdx.x + n + delta ];

Topic		Replies	Views
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3933	May 30, 2008
Constant Memory Bandwidth Program CUDA Programming and Performance	1	1595	May 19, 2011
Shared memory bank conflicts? CUDA Programming and Performance	0	831	June 4, 2009
Incorrect result while using shared memory to get maximum value CUDA Programming and Performance	3	369	November 20, 2021
Theoretical ON-CHIP Bandwidth how to determine? CUDA Programming and Performance	15	11584	October 16, 2009
Significantly lower device memory bandwidth when using higher thread counts CUDA Programming and Performance	2	189	February 6, 2024
Performance question CUDA Programming and Performance	2	1747	November 3, 2008
shared memory usage per Block VS per SM CUDA Programming and Performance	2	8544	May 3, 2010
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3482	November 25, 2008
shared memory and CUDA calculator CUDA Programming and Performance	6	4041	October 26, 2008

shared memory loading

Related topics