Let us assume we need to perform 1000 matrix-vector multiplications completely asynchronously. One can use cuBLAS and streams, but only 16 such kernels will actually run in parallel (or 32 on devices with compute capability 3.5). This is a very important constraint and I am wondering whether it can be overcome by using dynamic parallelism, *i.e.*, to create a kernel function that calls cuBLAS functions and is executed on a grid whereby each thread generates a new grid by invoking a “child kernel”. The overall code I have in mind would be something like this:

```
__global__ my_wrapper(float *data, int *sizes){
int i = threadIdx.x + blockIdx.x * blockDim.x;
// For each i start a new grid
// Call cuBLAS or some custom kernel
}
int main(){
my_wrapper<<<M,N>>>(data, sizes);
}
```

My question is whether there are any strict hardware limitations that apply in the case of dynamic parallelism. Can one generate 1000 such kernels to perform the asynchronous mat-vec operations? Is there some other way to perform cuBLAS operations in parallel overcoming the limitation of 16 parallel kernels? What would be the best practice in such a case?