Someone must have had the same question as I do, so I apologize if this topic is a repeat.
I’m writing a CUDA program which would perform multiple multiplications of small matrices and vectors.
Each operation will have the following structure.
(10 x 100) x (100 x 100) x (100 x 10) x (10 x 100) x (100 x 100) x (100)
where (10 x 100) is a matrix  and (100) is a 100-element vector.
I will have to perform around 50.000 of such operations, so I’m trying to choose the right structure for such a program…
I was thinking of either
having 1 block doing 1 operation (launch 1 kernel with 50.000 blocks)
The biggest problem is that on each stage of this operation I will need different number of threads:
1: (10 x 100) x (100 x 10) = (10 x 10)
so I’ll need only 10x10 threads here
2: (10 x 10) x (10 x 100) = (10 x 100)
100x10 threads are needed here
3: (10 x 100) x (100)
only 10 threads are needed here.
I can’t figure out how many threads should I allocate. If I allocate 100x10, then most of them will stay idle while few will perform things I need them to.
Alternatively, if I allocate say 10x10 threads, there isn’t much parallelism and I’ll have huge problems with coalescing.
having 1 kernel to perform 1 operation (launch 50.000 concurrent kernels and possibly save some time with asynchronous data copy)
This topic suggests that I can benefit from concurrent kernels if my block size is small (which is the case for me).
What do you think is best for me?