Hi,
I have the following code:
#define BLOCK_THREAD_SIZE 256
//... load data to shared memory arrays.
__shared__ int smnumMoveSamples[ BLOCK_THREAD_SIZE];
__shared__ int smNb[ BLOCK_THREAD_SIZE];
__shared__ float smw2[BLOCK_THREAD_SIZE];
__shared__ int smInputTracePos[ BLOCK_THREAD_SIZE];
for( int iTraceIndex = 0; iTraceIndex < BLOCK_THREAD_SIZE; iTraceIndex ++ )
{
fTraceOutPhase += smnumMoveSamples[ iTraceIndex ] + smInputTracePos[ iTraceIndex ] * threadIdx.x;
fTraceOutPhase2 += smw2[iTraceIndex] * blockIdx.x;
fTraceOutStack += smNb[ iTraceIndex ] - threadIdx.x; fFold += blockIdx.x;
}
This is a test code I tried (the real code is a bit different with texture access and stuff…) however I see this code
amounts to ~70-80% of the time - i.e. I waste most of the kernel time on this !!! I always thought it was the texture/gmem access…
This for loop is for every thread, since I need to go over all the data per thread.
I guess I’m doing the worst ever shared memory access - but can this penalty be caused by shared memory bank conflicts? anyway
of making it better?
thanks
eyal.