Problem with dynamically allocated shared memory

I’m trying to write some test code and faced with strange behavior of shared memory array - the data gets damaged. I guess the problem is in calculation of the shared memory size.

The code just copies array of chars to the shared memory and after this to the global memory:

const int WORDLEN = 3;

const int GRIDSIZE = 32;

const int BLOCKSIZE = 256;

texture<char, 1> texWordsGPU;

__global__ void testKernel( int g_wordlen, long g_len, char* g_wordsProcessedGPU) 

{

     extern  __shared__  char sdata[];

     int c;

    for(int i=blockIdx.x*blockDim.x; i<g_len; i+=blockDim.x*gridDim.x)

     {

          for(c = 0; c < g_wordlen; c++)

               sdata[c + 4*threadIdx.x] = tex1Dfetch(texWordsGPU, c + i+threadIdx.x);

          for(c = 0; c < g_wordlen; c++)

               g_wordsProcessedGPU[c + i+threadIdx.x] = sdata[c + 4*threadIdx.x];

     }

}

void runTest( int argc, char** argv) 

{

     CUT_DEVICE_INIT(argc, argv);

     long len = WORDLEN * wordcount(WORDLEN);

     char* words = new char[len];

     len = gen_word_list(WORDLEN,words,len);  //generate test data

     output("In.txt",WORDLEN,words,len); // save to file

    char* wordsGPU;

     char* wordsProcessedGPU;

     cudaMalloc((void**)&wordsGPU, len * sizeof(char));

     cudaMalloc((void**)&wordsProcessedGPU, len * sizeof(char));

     cudaMemcpy(wordsGPU, words, len * sizeof(char), cudaMemcpyHostToDevice);

    cudaBindTexture(0, texWordsGPU, wordsGPU, len * sizeof(char));

    dim3  grid( GRIDSIZE, 1, 1);

     dim3  threads( BLOCKSIZE, 1, 1);

     testKernel<<< grid, threads, BLOCKSIZE * WORDLEN * sizeof(char) >>>( WORDLEN, len,  wordsProcessedGPU);

    cudaMemcpy(words, wordsProcessedGPU, len * sizeof(char), cudaMemcpyDeviceToHost);

     output("Out.txt",WORDLEN,words,len); // save to file

    delete [] words;

     cudaFree(wordsGPU);

     cudaFree(wordsProcessedGPU);

}

After a lot of experiments i have found the solution. Changing the shared mem size calculation from:

BLOCKSIZE * WORDLEN * sizeof(char)

to :

BLOCKSIZE * (WORDLEN+1) * sizeof(char)

leads to the correct result, and the minimum size is

BLOCKSIZE * (WORDLEN+1) * sizeof(char)-31

But it is still unclear for me where the problem is!

OK, I think I see one slight error:

When you define your shared memory size, you make it (decoding the constants): 25631=786

When you access the memory, you use the following command:
sdata[c + 4*threadIdx.x]

The max value of c is WORDLEN=3, the max value of threadIdx.x = BLOCKSIZE-1 = 255. So the biggest index into sdata you ask for is 3 + 4 * 255 = 1023, which is well beyond the allocated size.

You probably want your shared memory allocation to follow your access pattern: ((BLOCKSIZE-1) * 4 + WORDLEN) * sizeof(char), since you’re explicitly typing ‘4’ in your access to sdata.

I’m actually mildly confused about the way you access sdata. With wordlen = 3, you never touch every fourth entry in sdata. The only reason I see for that is to avoid bank conflicts (so words exist in their own 32-bit dword and threads don’t cross-talk accessing them); but that’s going to burn you if WORDLEN is ever greater than 4.

Anyway, hope that helps!

You are right!
I’ve checked your suggestions and now the reason becomes clear.
Thanks for the help!

I proceed with a new experiment and found much more strange thing. I want the words reversed:

__global__ void testKernel( int g_wordlen, long g_len, char* g_wordsProcessedGPU)

{

    for(int i=blockIdx.x*blockDim.x; i<g_len; i+=blockDim.x*gridDim.x)

    {

       for(c = 0; c < g_wordlen; c++)

          g_wordsProcessedGPU[c + i+threadIdx.x] = tex1Dfetch(texWordsGPU, (g_wordlen - c -1) + i+threadIdx.x);             

    }

}

But the data gets damaged. The code is very simple and works perfect on CPU but why the GPU result is different?