reducing memory foot print in parallelized function.

Currently I am trying to rewrite a function like this:

float intermediate_val[100];
for (N loop) {
  ...
  write_values(intermediate_val);
  use_values(intermediate_val);
  ...
}

in CUDA. That is, there is a loop that can be easily paralleled, but each loop writes, then uses values in array “intermediate_val”. Each instance of the loop does not depends on the values of intermeidate_val.

The most obvious implementation is to allocate enough memory for intermediate_val for all instance the loop:

float * vals_gpu;
  cudaMalloc(&vals_gpu,N*100);
  kernel_loop<<<mesh,block>>>(..., vals_gpu, ...);

and then pass the pointer to the kernel implementing the loop, inside which it calculates the appropriate offset and use that 100 float block.

__global__ kernel_loop(..., vals_gpu, ...) {
     int offset = calculate_offset(blockIdx, blockDim, threadIdx);
     float * intermediate_val = vals_gpu + offset;
     ...
     write_values_gpu(intermediate_val);
     use_values_gpu(intermediate_val);
     ...
}

But this means I need to allocate a huge N*100 memory array before calling the kernel and for large N, GPU memory runs out.

Is there anyway to avoid this? For example, is it possible for the kernel to allocate float[100] array and free them as needed?