I have a large array that I’m copying into global memory and hoping to make it available for the 500 kernels that modify it.
The array has to be 500000 in length. I’ve declared it in the host as:
#define VERTMAX 500000
#define NSMAX 15000
//beggining of the code
cudaMalloc((void**)&nrays,VERTMAX*sizeof(int));
//some code
cudaMemcpy(nrays, hnrays, VERTMAX*sizeof(int),cudaMemcpyHostToDevice );//array
//d_tpick is float array of 500 items
//d_tdist is int array of 500 items
//d_x is float array of 500 items
//d_y is float array of 500 items
//d_z is float array of 500 items
//d_sums is float array of 4 items
//ngshot is an int
//t is a float array of 5000000 items, values get read inside the kernel with no problems
traceshot<<<500,1>>>( d_tpick, d_dist, d_x, d_y, d_z, d_sums, ngshot, t , nrays);
Inside the traceshot kernel:
int i,j,k,ish,jsh,ksh,iii,iiiii,
nshot,ngshot,nseg,
md,iscell,jscell,kscell,ist,nstat;
int addy;
int iseg [NSMAX];
//Double variables
double x,y,z,xi,yj,zk,xs,ys,zs,
d,length,dist,tpick,dt,dtsum,
dt2sum,dusum,du2sum,fx,fy,fz;
//Double Arrays
double gradt[4],dd[4],tstat[6];
while(j<=nseg){//0 to 15000 or less
//some code
while(iii<=nseg){//0 to 15000 or less
//some code
//index gets a valid integer value
nrays[index] +=1;//error in this line
}
}
ptxas error : Entry function ‘_Z9traceshotPdS_S_S_S_S_iPfPi’ uses too much local data (0xea60 bytes, 0x4000 max)
make: *** [somename.o] Error 255
If I remove the error line the program compiles sucessfully. Why does CUDA say I’m using too much local data if the large array is stored in global memory?
(Thanks in advance, I’ve been crawling the forum for weeks now and I haven’t found a logical explanation for this problem so I had to post)