Not sure where feature requests are being documented these days so I figured I’d post here hoping some NVIDIA folks are listening.
For CUBLAS it’d be really nice to be able to batch up calls without separate streams (to avoid hitting 16 concurrent kernels on fermi and blocking). For example, I’ve got multiple matrices loaded and want to save on the kernel overhead from launching each individually.
For CUFFT, a way to perform a fast convolution, providing the vector to multiply by (with ability to reuse it for batch calls if desired) would save on a lot of overhead from having 1 launch instead of 3 separate launches, the multiply kernel being very high overhead since it’s only a couple lines.
Thanks.