Dynamic parallelism vs Streams

Let us assume we need to perform 1000 matrix-vector multiplications completely asynchronously. One can use cuBLAS and streams, but only 16 such kernels will actually run in parallel (or 32 on devices with compute capability 3.5). This is a very important constraint and I am wondering whether it can be overcome by using dynamic parallelism, i.e., to create a kernel function that calls cuBLAS functions and is executed on a grid whereby each thread generates a new grid by invoking a “child kernel”. The overall code I have in mind would be something like this:

__global__ my_wrapper(float *data, int *sizes){
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  // For each i start a new grid
  // Call cuBLAS or some custom kernel 

int main(){
  my_wrapper<<<M,N>>>(data, sizes);

My question is whether there are any strict hardware limitations that apply in the case of dynamic parallelism. Can one generate 1000 such kernels to perform the asynchronous mat-vec operations? Is there some other way to perform cuBLAS operations in parallel overcoming the limitation of 16 parallel kernels? What would be the best practice in such a case?

What a tumbleweed! Anyway, dynamic parallelism can be done only on devices of c.c. > 3.5 (not my case!)