[Feature request] more flexible cublas<T>gemmStridedBatched

Currently cublasgemmStridedBatched requires 2 input matrices to have the same batch dimension. Let say A of size (LH)mk and B of size (LH)kn -> answer C will have size (LH)mn. But many occasions we have B of size (H)kn only and don’t want to make many copies of B so that it matches the batch dimension of A. This can be easily fix with new arguments (sizes) indicating number of elements of A and B (or their max ranges in memory) and change the pointer reference loop to:
for (int p = 0; p < batchCount; ++p) {
A_array[p] = A + (pstrideA) % sizeA;
B_array[p] = B + (p
strideB) % sizeB;
C_array[p] = C + p*strideC;
}
It would be great if CUDA comes with this new feature. Lots of unnecessary copies will be avoided.

I recommend that feature requests be filed as bugs at developer.nvidia.com, include the keyword RFE somewhere, to indicate it is a Request For Enhancement, not a defect report.