Bottlneck of dynamic programming in CUDA: global memory allocations to exchange data with child kern

I have a the following code:

__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    [...]        

    double phi_cap1, phi_cap2;

    if(i<M) {   

         for(int m=0; m<(2*K+1); m++) {

              [calculate phi_cap1];

              for(int n=0; n<(2*K+1); n++) {

                 [calculate phi_cap2];

                 [calculate phi_cap=phi_cap1*phi_cap2];

                 [use phi_cap];

             }
    }

}

}

I would like to use Dynamic Programming on a Kepler K20 card to dispatch the processing of phi_cap1 and phi_cap2 in parallel to a bunch of threads to reduce the computation time. K=6 in my code, so I’m launching a single block of 13x13 threads.

Following the CUDA Dynamic Parallelism Programming Guide, I’m allocating a matrix phi_cap of 169 elements (formed by the products of phi_cap1 and phi_cap2), needed to exchange the data with the child kernel, in global memory. Indeed, quoting the guide,

As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap.

I then ended-up with the following code

[code]
global void interpolation(const double2* restrict data, double2* restrict result, const double* restrict x, const double* restrict y, const int N1, const int N2, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;

[...]   

dim3 dimBlock(2*K+1,2*K+1); dim3 dimGrid(1,1);

if(i<M) {   

double* phi_cap; cudaMalloc((void**)&phi_cap,sizeof(double)*(2*K+1)*(2*K+1));

child_kernel<<<dimGrid,dimBlock>>>(cc_diff1,cc_diff2,phi_cap);

for(int m=0; m<(2*K+1); m++) {

    for(int n=0; n<(2*K+1); n++) {

                    [use phi_cap];

    }
}

}
}
[\code]

The problem is that the first routine takes 5ms to run, while the second routine, even by commenting the child_kernel launch, takes 23ms, with practically all the time spent in the cudaMalloc API.

Since in dynamic programming one would often need allocating memory space to exchange data with the child kernels, and the only solution seems to be global memory taking so much time, it seems to me that one serious bottleneck of the usefulness of dynamic programming is the data exchange, unless there is a way to circumvent the global memory allocation issue.

I would be grateful if someone could comment on the above observation and tell me about any workaround. Thanks

Please, take a look at the discussion

http://stackoverflow.com/questions/14855408/bottlneck-of-dynamic-programming-in-cuda-global-memory-allocations-to-exchange

For an interpolation problem, I have finally improved my results using dynamic programming.