Maximum concurent kernels For numbers of streams > 16

Hi,

From the reference document (page 11) for CUBLAS,

Does this means that a batch of 16 kernels will be executed in series until all 1024 streams are done?

In other words, 1024/16 batches will be calculated one after the other?

Assuming all the data for all the small matrices is transfered to the GPU at once.