Loops cause too much local data error. Trouble processing large arrays in global memory on a kernel.

I have a large array that I’m copying into global memory and hoping to make it available for the 500 kernels that modify it.

The array has to be 500000 in length. I’ve declared it in the host as:

#define VERTMAX 500000

#define NSMAX 15000

//beggining of the code


//some code

cudaMemcpy(nrays, hnrays, VERTMAX*sizeof(int),cudaMemcpyHostToDevice );//array

//d_tpick is float array of 500 items

//d_tdist is int array of 500 items

//d_x is float array of 500 items

//d_y is float array of 500 items

//d_z is float array of 500 items

//d_sums is float array of 4 items

//ngshot is an int

//t is a float array of 5000000 items, values get read inside the kernel with no problems

traceshot<<<500,1>>>(  d_tpick, d_dist, d_x, d_y, d_z,  d_sums, ngshot, t , nrays);

Inside the traceshot kernel:

int    i,j,k,ish,jsh,ksh,iii,iiiii,



int addy;

int iseg [NSMAX];

	//Double variables

       double x,y,z,xi,yj,zk,xs,ys,zs,



	//Double Arrays

   	double gradt[4],dd[4],tstat[6];

while(j<=nseg){//0 to 15000 or less

    //some code

  while(iii<=nseg){//0 to 15000 or less

   //some code

   //index gets a valid integer value

nrays[index] +=1;//error in this line



ptxas error : Entry function ‘_Z9traceshotPdS_S_S_S_S_iPfPi’ uses too much local data (0xea60 bytes, 0x4000 max)

make: *** [somename.o] Error 255

If I remove the error line the program compiles sucessfully. Why does CUDA say I’m using too much local data if the large array is stored in global memory?

(Thanks in advance, I’ve been crawling the forum for weeks now and I haven’t found a logical explanation for this problem so I had to post)

There is a 16kb per thread local memory limit. Your kernel uses far more than that - iseg requires 60kb by itself.