Incorrect result of reversing array Compiller error?

I’m trying to write code to reverse elements of data chunks from linear array:

const int WORDLEN = 3;

const int GRIDSIZE = 32;

const int BLOCKSIZE = 256;

texture<char, 1> texWordsGPU;

__global__ void testKernel( int g_wordlen, long g_len, char* g_wordsProcessedGPU)


   for(int i=blockIdx.x*blockDim.x; i<g_len; i+=blockDim.x*gridDim.x)


      for(c = 0; c < g_wordlen; c++)

         g_wordsProcessedGPU[c + i+threadIdx.x] = tex1Dfetch(texWordsGPU, (g_wordlen - c -1) + i+threadIdx.x);



void runTest( int argc, char** argv)


    long len = WORDLEN * wordcount;

    char* words = new char[len];

    len = gen_word_list(WORDLEN,words,len);  //generate test data

   char* wordsGPU;

    char* wordsProcessedGPU;

    cudaMalloc((void**)&wordsGPU, len * sizeof(char));

    cudaMalloc((void**)&wordsProcessedGPU, len * sizeof(char));

    cudaMemcpy(wordsGPU, words, len * sizeof(char), cudaMemcpyHostToDevice);

   cudaBindTexture(0, texWordsGPU, wordsGPU, len * sizeof(char));

   dim3  grid( GRIDSIZE, 1, 1);

    dim3  threads( BLOCKSIZE, 1, 1);

    testKernel<<< grid, threads >>>( WORDLEN, len,  wordsProcessedGPU);

   cudaMemcpy(words, wordsProcessedGPU, len * sizeof(char), cudaMemcpyDeviceToHost);

for example, the expected result for "0123456789ab… " must be "210543876ba9… " but the result of GPU calculation something like "210453786ab9… "

So, where is an error?

Huh? for c == 1 and threadIdx.x == 0 you will write at the same place as for c == 0 and threadIdx.x == 1, but a different value.

Though unless you change your code to use shared memory, I think it will be impossible to be any faster than the CPU. If done right, it will probably also make your code simpler.