How to allocate shared memory for an array

I am trying to use dynamic allocate shared memory for variable “etan” like this:

__global__ void PT1(double* dvxdx, double* dvydy, double* dvxdy, double* dvydx, double* vx, double* vy, double* alpha, double* beta, int* index, double* kvx, double* kvy, double* etan,  double* Helem, double* areas, bool* isice, double* Eta_nbe, double* rheology_B, double n_glen, double eta_0, double rele,int nbe){
    extern __shared__ double ss[];
    int ii = threadIdx.x;
    int ix = threadIdx.x+blockIdx.x * blockDim.x;
    ss[ii] = etan[nbe-ii-1];
   __syncthreads();
   
   etan[ii] = ss[nbe-ii-1];
  for(int ix = threadIdx.x+blockIdx.x * blockDim.x; ix<nbe; ix += blockDim.x * gridDim.x) { 

       
        if (isice[ix]) ss[ix]  = min(exp(rele*log(eta_it) + (1-rele)*log(ss[ix])),eta_0*1e5);

    
    Eta_nbe[ix] = ss[ix]*areas[ix];

   
    }

}  

int main()
{

    double *d_etan;
    cudaMalloc((void**)&d_etan, nbe*sizeof(double));
    cudaMemcpy(d_etan, etan, nbe*sizeof(double), cudaMemcpyHostToDevice); 

PT1<<<gride, blocke,nbe*sizeof(double)>>>(dvxdx, dvydy, dvxdy, dvydx, d_vx, d_vy, d_alpha, d_beta, d_index, kvx, kvy, d_etan, d_Helem, d_areas, d_isice, Eta_nbe, d_rheology_B, n_glen, eta_0, rele, nbe);        cudaDeviceSynchronize();     
}

but we face ERROR launching GPU C-CUDA program: invalid argument
and the result is not the same as the original code without shared memory.

what is nbe ? It matters. gride and blocke also matter for this error code. The most likely issue is that one of those 3 are invalid. And since you’ve provided the actual values for none of those 3, I’m not sure how anyone could help you.

Please find the revised code below:

global void PT1(double* dvxdx, double* dvydy, double* dvxdy, double* dvydx, double* vx, double* vy, double* alpha, double* beta, int* index, double* kvx, double* kvy, double* etan, double* Helem, double* areas, bool* isice, double* Eta_nbe, double* rheology_B, double n_glen, double eta_0, double rele,int nbe){

extern __shared__ double s[];
int t = threadIdx.x;
int tr = nbe - t - 1;
s[t] = etan[t];
__syncthreads();

etan[t] = s[tr];

Eta_nbe[ix] = s[ix]*areas[ix];

}

int main() {

for(int i=0;i<nbe;i++){
etan[i] = 0;
}

double *d_etan;
cudaMalloc(&d_etan, nbe*sizeof(double));
cudaMemcpy(d_etan, etan, nbe*sizeof(double), cudaMemcpyHostToDevice);

PT1 <<<gride, blocke,nbe*sizeof(double)>>>
(dvxdx, dvydy, dvxdy, dvydx, d_vx, d_vy, d_alpha, d_beta, d_index, kvx, kvy, d_etan, d_Helem, d_areas, d_isice, Eta_nbe, d_rheology_B, n_glen, eta_0, rele, nbe);

cudaDeviceSynchronize();

}

nbe = size of the array = 35479552, gride = 69280, blocke = 512
GPU Architecture: Tesla V100 Compute Capability 7.x

ERROR launching GPU C-CUDA program: invalid argument

We followed the dynamic shared memory example given in this link:
https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

That is way too large, by orders of magnitude. If you read the article at the link you referenced in its entirety, you will get a rough idea about the amount of shared memory typically available (it can be somewhat more on modern GPUs, but not massively more). In appendix K (table 15) of the CUDA Programming Guide you can find the shared memory size of all currently supported GPU architectures.

The shared memory per block on our GPU architecture is 49152 bytes.

The parameters/arrays are of type double.

Would the maximum array size than can be transferred from global to shared be 6144? 49152 divided by 8?

And would the execution configuration look like this PT1 <<<gride, blocke,6144*sizeof(double)>>>