The API is documented, and there are 3 code examples in the cufft documentation that indicate how to use cufftPlanMany() in 3 different scenarios.
Perhaps you are getting tripped up on the advanced data layout parameters. These can be essentially disregarded if you have a relatively simple scenario where the data for each signal is in a single group, and the groups are adjacent in memory. We can also considerably simplify the situation for demonstration purposes by performing a C2C transform instead of an R2C transform. And as you point out, the averaging would not be done by CUFFT but by your own supplied kernel. For that you might want to take a look at the new cufft callback feature, but I would start just by getting something basic working.
To that end, here is a sample code that I put together that should hopefully help with understanding cufftPlanMany.
#include <cufft.h>
#include <cuComplex.h>
#include <stdio.h>
#define N_SIGS 32
#define SIG_LEN 1024
int main(){
cuFloatComplex *h_signal, *d_signal, *h_result, *d_result;
h_signal = (cuFloatComplex *)malloc(N_SIGS*SIG_LEN*sizeof(cuFloatComplex));
h_result = (cuFloatComplex *)malloc(N_SIGS*SIG_LEN*sizeof(cuFloatComplex));
for (int i = 0; i < N_SIGS; i ++)
for (int j = 0; j < SIG_LEN; j++)
h_signal[(i*SIG_LEN) + j] = make_cuFloatComplex(sin((i+1)*6.283*j/SIG_LEN), 0);
cudaMalloc(&d_signal, N_SIGS*SIG_LEN*sizeof(cuFloatComplex));
cudaMalloc(&d_result, N_SIGS*SIG_LEN*sizeof(cuFloatComplex));
cudaMemcpy(d_signal, h_signal, N_SIGS*SIG_LEN*sizeof(cuFloatComplex), cudaMemcpyHostToDevice);
cufftHandle plan;
int n[1] = {SIG_LEN};
cufftResult res = cufftPlanMany(&plan, 1, n,
NULL, 1, SIG_LEN, //advanced data layout, NULL shuts it off
NULL, 1, SIG_LEN, //advanced data layout, NULL shuts it off
CUFFT_C2C, N_SIGS);
if (res != CUFFT_SUCCESS) {printf("plan create fail\n"); return 1;}
res = cufftExecC2C(plan, d_signal, d_result, CUFFT_FORWARD);
if (res != CUFFT_SUCCESS) {printf("forward transform fail\n"); return 1;}
cudaMemcpy(h_result, d_result, N_SIGS*SIG_LEN*sizeof(cuFloatComplex), cudaMemcpyDeviceToHost);
for (int i = 0; i < N_SIGS; i++){
for (int j = 0; j < 10; j++)
printf("%.3f ", cuCrealf(h_result[(i*SIG_LEN)+j]));
printf("\n"); }
return 0;
}