jam11
1
Hi,
From the reference document (page 11) for CUBLAS,
Does this means that a batch of 16 kernels will be executed in series until all 1024 streams are done?
In other words, 1024/16 batches will be calculated one after the other?
Assuming all the data for all the small matrices is transfered to the GPU at once.