Slow shared memory reads, what am I doing wrong?

Hi All

I’m trying to create some code, that does the following.

  1. Allocate two buffers of shared memory, ‘channel’ and ‘result’. ‘Channel’ using 5120 Bytes, and ‘result’ using 8192 Bytes. The result buffer is 2048 float samples.

The program will then loop 256 times, each time doing the following:

  1. Load data into ‘channel’, after this, __sync (very low processing time, since I tried to remove this part)

  2. Process data equal to: Do 10 multiplications, one sqrt, one sincos, fetch 2 values from the channel buffer, add to the specific value at the result buffer (requiring 2 reads and 2 writes). Now, the threads will read all over the place, so some will be overlapping, but each result value is only written by a single thread. I use [32-by-8 threads], giving a total of 256.

Now… my code runs on a GTX260 with 27 SM’s and capability 1.3, using XP 32-bit. All the data is fairly manageable.

The code calculates 2^19 number of points, each point requiring 256 iterations of the process in no. 3. This would mean I use 2^28 writes of a float value, which is roughly 1 GB of data transfer. Again, reads can be clashing, but write adresses are unique to the thread. If I average 20 runs, I get a throughput of 7GB/s when using shared memory. When I look at the profiler, it states that I’m using >60% of my time on “gst 64b”.

This just doesn’t seem to add up? I would expect my card to write much faster to shared memory. Am I simply not getting this?

Cheers

Henrik Andresen

P.S. I have embedded the code for my kernel here:

[codebox] shared float channelBuffer[MAX_CHANNEL_BUFFER_FLOAT/sizeof(float)];

__shared__ float resultBuffer[MAX_RESULT_BUFFER_FLOAT/sizeof(float)];

int nCount, nTotalThreads, nThread, nChannel, nChannelOffset, nSampleOffset, nBufferOffset;

float sSquareElm, sSine, sDist, sDist2Sample, sDelta2Phase, sDeltaD, sElmPos, sPhaseDelta;

float sSampleReal, sSampleImag, sCosVal, sSinVal, sBeam2ChannelOffset;

int nSample, nResultOffset;

int nBeam;

int nSubBlockSize = dnBlockSamples / blockDim.x;

nThread = threadIdx.x + blockDim.x * threadIdx.y;

nTotalThreads = blockDim.x * blockDim.y;

nBeam = threadIdx.y + blockIdx.y * blockDim.y;

// Reset result buffer

for( nCount = nThread; nCount < 2 * dnBlockSamples * blockDim.y; nCount += nTotalThreads ){

	resultBuffer[nCount] = 0;

}

// Calculating offsets

sBeam2ChannelOffset =  (float)( blockIdx.x * dnBlockSamples * 2 ) * dsFs * dsDr / dsC;

nSampleOffset = 2 * rintf( sBeam2ChannelOffset + dsSampleOffset - (float)dnOverlap );

nBufferOffset = 2 * rintf( (float)dnOverlap - sBeam2ChannelOffset );

nResultOffset = 2 * ( threadIdx.x * nSubBlockSize + dnBlockSamples * threadIdx.y );

sDist2Sample = dsFs / dsC;

sDelta2Phase = -2.0f * MATH_PI * dsFc / dsC;

sSine = dpBeamAngle[nBeam];

// Channel iteration

for( nChannel = 0; nChannel < dnElements; nChannel++ ){

	// Load data		

	nChannelOffset = nSampleOffset + dnInputPitch * nChannel;

	for( nCount = nThread; nCount < 2 * dnBlockSize; nCount += nTotalThreads ){

		channelBuffer[nCount] = dpChannel[nChannelOffset + nCount];

	}

	// Make sure all data is transferred before processing begins

	__syncthreads();

	// Constants for each line	

	sElmPos = ((float)nChannel - ( (float)dnElements - 1.0f )/2.0f ) * dsPitch;

	sSquareElm = sElmPos * sElmPos;

	sDist = float( threadIdx.x * nSubBlockSize + blockIdx.x * dnBlockSamples ) * dsDr;

	// Iterate over all values				

	for( nCount = 0; nCount < nSubBlockSize; nCount++, sDist += dsDr ){

		sDeltaD = __fsqrt_rn( sSquareElm + sDist*sDist - 2 * sElmPos * sSine * sDist );

		nSample = rintf(( sDeltaD + sDist ) * sDist2Sample ) * 2 + nBufferOffset;

		sPhaseDelta = ( sDeltaD - sDist ) * sDelta2Phase;

		sSampleReal = channelBuffer[nSample] * dsBfGain;

		sSampleImag = channelBuffer[nSample+1] * dsBfGain;

		

		__sincosf( sPhaseDelta, &sSinVal, &sCosVal );

		resultBuffer[2*nCount + nResultOffset] += ( sSampleReal * sCosVal + sSampleImag * sSinVal );

		resultBuffer[2*nCount + nResultOffset + 1] += ( sSampleReal * sSinVal - sSampleImag * sCosVal );			

	}

}

	// Make sure all data is processed before transfer to output buffer begins

__syncthreads();

nResultOffset = 2 * ( dnSamples * blockDim.y * blockIdx.y + dnBlockSamples * blockIdx.x );

// Output values of buffer. Both reads and writes are coalesced within the thread

for( nBeam = 0; nBeam < blockDim.y; nBeam++, nResultOffset += 2 * dnSamples ){

	for( nCount = nThread; nCount < 2 * dnBlockSamples; nCount += nTotalThreads ){

		dpOutput[nResultOffset + nCount] = resultBuffer[nCount + 2 * nBeam * dnBlockSamples];

	}

}[/codebox]