I am using K80 Tesla GPU with 48 kb shared memory per sm. I have launched a kenel like
__global__ void kernel( some inputs )
{
// i is the equation number
int i = threadIdx.x;
// allocating memory in the shared memory for all 6 arrays
/* __shared__ double a[n]; __shared__ double b[n]; //case 1
__shared__ double c[n]; __shared__ double d[n];
__shared__ double e[n]; __shared__ double x[n];
*/
extern __shared__ double array []; // case 2
double *a = array; double *b = (double*) &a[n];
double *c = (double*) &b[n]; double *d = (double*) &c[n];
double *e = (double*) &d[n]; double *x = (double*) &e[n];
// rest of kernel unnecessary for this question
}
int main(int argc, char ** argv)
{
// launching the kernel
// kernel<<<1, n>>>(some inputs); //case -1
kernel<<<1, n, 6*n*sizeof(double)>>>(some inputs); // case -2
checkCuda( cudaPeekAtLastError() ); checkCuda(cudaDeviceSynchronize());
}
For case -1, I can launch the kernel with around 750 threads but for case -2 it increased and up to 896. Six double type arrays of size 896 need 68968 = 43008 bytes = 43.008 kb. I know that I cannot use all 48 kb but 43 kb looks little less for me, any justification? Also what is the problem with case 1 ans case 2, why there is a difference between number of threads?
Thanks in advance.