CUDA program outputs random results when using large arrays

__global__ void get_string( char *L, char *buff, int *buf_index, int b_size)


   // "L", "buff" and "buf_index" all have a size of "b_size"

int tid = threadIdx.x + (blockIdx.x * blockDim.x) ;

if( tid < b_size )                     //line 01

       {                                      //line 02

         if (  buf_index[tid] > 0  )          //line 03

          L[tid] = buff[ buf_index[tid]-1 ];  //line 04

         else                                 //line 05

          L[tid] = buff[ b_size-1 ];          //line 06

       }                                      //line 07

//  if (tid < b_size)        //line 08

  //  L[tid]=buff[tid];        //line 09


int main()


// code to setup device pointers

// code to transfer "dev_L, dev_buffer, dev_bufferindex, buffer_size" to device using cudaMemcpyHostToDevice

get_string<<< 8, 128 >>>( dev_L, dev_buffer, dev_bufferindex, buffer_size ); 

// 1024 threads for the arrays "L", "buff" and "buf_index" each having a size of 1024

// code to get "L" from device using cudaMemcpyDeviceToHost


Now if the size of “L”, “buff” and “buf_index” is small like a value less than 10,

the lines labeled “line 01/02/03/04/05/06/07” works, and the values stored in the array “L” are correct.

But if the size of “L”, “buff” and “buf_index” is large like 1024.

the lines labeled “line 01/02/03/04/05/06/07” does NOT works. The program does work, but every time I run the

compiled program using the same values stored in “L”, “buff” and “buf_index”, I get different values stored in the array “L”. And if I remove “line 01/02/03/04/05/06/07”, and uncomment “line 08/09”, and recompile the program, the values that are now stored in the array “L” are correct, even though the values stored in this version of “L” is not what I want to be stored in “L”.

What I want, is for my program to work using “line 01/02/03/04/05/06/07” with a size like 1024 or greater.

So could anyone help me out with this?

I am running:

WindowsXP SP2 [32-bit]

Geforce GTX280

CUDA Toolkit 3.2

nvidia driver 260.99

Visual Studio 2008

What is your initialization code? Also you may try to debug your kernell. And use cuprintf to check what is going on.

using the same values stored in “L”, “buff” and “buf_index”, I get different values stored in the array “L”

What do you mean by that? That L should be the same before and after running?
If that is the case, then your code is incorrect.