I am currently using intel tools (MKL and TBB) and I am considering moving to cuda since the performance isn’t good enough.
My app mostly does a lot of BLAS calls on small data structures (matrices of up to ~ 100x100). When I use Intel’s tools I use single threaded calls in MKL and use TBB to run the calls from multiple threads, this way the CPU is utilized better because the default threaded version of MKL isn’t useful (since the data structures are too small).
Is there any way to do something similar using cublas? Since I am new to cuda I am not really familiar with the terminology, but I guess this would be running multiple streams / kernels, and have every stream/kernel issue multiple BLAS calls (I may be wrong of course).