Passing local variables from parent kernel to child kernel in dynamic parallelism in cuda

I am trying to use dynamic parallelism in cuda. I am in a situation such that parent kernel has a variable that needs to be passed to child for further computation. I have gone through the resources in web here

and it mentions that local variables cannot be passed to the child kernal and has mentioned the ways to pass variables and I have tried to pass the pass the variable as

#include <stdio.h>
#include <cuda.h>

__global__ void square(float *a, int N)
  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  a[idx] = a[idx] * a[idx];
// Kernel that executes on the CUDA device
__global__ void first(float *arr, int N)
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int n=N; // this value of n can be changed locally and need to be passed
  cudaMalloc((void **) &n, sizeof(int));

  square <<< 1, N >>> (arr, n);


// main routine that executes on the host
int main(void)
  float *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:

  first <<< 1, 1 >>> (a_d, N);
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);

and the value of parent to child kernel is not passed . how can I pass the value of local variable. Is there any way to do so?

either pass it as a function parameter (without the cudaMalloc((void **) &n, sizeof(int)))

or move it to global memory, and pass to the child a pointer to the global memory location

is it possible to pass the local variable as

square <<< 1, N >>> (arr, n);

from parent kernel???

where n is a local variable

the question is whether you are essentially referring to a variable, or to an array

as i understand it, you are using a thread and grid dimension of 1, which simplifies a lot, and really makes it a variable, not an array

should you wish to pass an array, you would have to go the global memory route
then what troubles me is that you use a local variable to malloc, and then seemingly attempt to pass a pointer to local memory, which raises red flags in my mind