I am trying to improve an aspect of a CUDA program i have running on a Tesla K40 GPU:

In my original CUDA implementation, i launch a 2D grid of 16x16 blocks. In the algorithm, each thread computes (1) an initial value and (2) fine grained results based on (1). The problem is that (1) is actually the same for all threads within a block.

My first approach at optimizing this aspect was to just make the first thread of each block to compute the common value and store it in shared memory, followed by a __syncthreads() instruction. Afterwards, all threads could do (2) using the initial value just as before, with the advantage of being computed only once per block. I thought this solution could work faster, but in the end the performance was the same as in the original solution. So i discarded it and returned to the original.

Given that i recently got a Tesla K40, i thought that it was the perfect moment to try the so attractive dynamic parallelism feature. What i did is to first launch a coarse kernel, where each thread represents a block of the original CUDA implementation. The idea is to compute the initial value on this coarse kernel, and then make each thread launch a child kernel of 16x16 threads, passing the common value as an argument. In the board this sounded like a good optimization, in practice i get horrible performance, up to 20X times slower.

This is the idea of how i intended to use dynamic parallelism:

```
__global__ coarse_kernel( parameters ){
int common_val = compute_common_val();
dim3 dimblock(16, 16, 1);
dim3 dimgrid(1, 1, 1);
child_kernel <<< dimgrid, dimblock >>> (common_val, parameters);
}
__global__ child_kernel( int common_val, parameters ){
// use common value
do_computations(common_val, parameters);
}
```

A realistic case for example would be a 45 x 45 coarse grid of 16x16 blocks, where each thread spawns

a new child kernel of just 16x16 threads. As you see, there will be many many child kernels.

So in the end i just went back to the original CUDA implementation, but it is sad that none of the ideas above worked in practice.

What type of performance problem or limitation am i facing here?? Does someone know what approach should i take?