Some Guidance on optimal approach to batch Matrix Multiply

Hello learned Developers,

I am looking at leveraging the compute power of GPUs for my specific problem and have been reading a TON of information from Pro-Tips on CUTLASS to cuBLAS examples, video tutorials and the programming best practices guide. There is a wealth of information out there, to the point that I am now overwhelmed about how to tackle the task I have. Every time I think I have found the best way forward, there are considerations specific to my use case that raise doubts. I will attempt to describe the problem I am trying to solve.

  1. I have a 1 x 50 matrix (vector) that is calculated by the CPU once and can be considered a constant by the GPU.
  2. I have a 10 x 92378 matrix that is also a constant to be accessed by the GPU.
  3. I need to access a 1 x 10 subset of the 1 x 50 vector and multiply it by the 10x92378 matrix giving me a 1 x 92378 answer that requires some further processing (see point 5 below). I have an algorithm that calculates the 10 indices (pointers) for any and all combinations.
  4. There are 50Choose10 different unique ways to select the 1x10 subset from the 1x50 vector or 10,272,278,170 different combinations! Each of which needs to be multiplied by the 10x92378 (constant) matrix.
  5. After calculating the 1x92378 answer for each combination, I need to sort the vector, noting the original index of that answer and the combination that produced that answer. Here is a simplified example: Combination 3096 produced an answer vector [4,6,3,5,1]. The sorted vector would look like this:
    [6,1,3096
    5,3,3096
    4,0,3096
    3,2,3096
    1,4,3096]
  6. I can then remove answers that are less than a certain value, say 4 in my example, leaving a reduced vector:
    [6,1,3096
    5,3,3096
    4,0,3096]

Currently I am leaning towards strided batch matrix multiply using cuBLAS.
I am not looking for specific coded solutions (of course they are welcome), rather some guidance on the approach to take to architect a solution. I realise that my matrices do not fall on multiples of 16, but I believe cuBLAS deals with this well. I also know that I could attempt to code an optomised solution using streams, shared memory etc without using cuBLAS but I will forever be questioning whether my solution is truly optimal.

So, I would be most grateful for any guidance on a good approach for this task. optimal thread block sizes, strided approaches, batch processing, how to utilise memory optimally etc. For info I have copied the DevQuery results for my GPU below. Many thanks in advance.
If there is a more appropriate forum to ask this question, please let me know.

Device 0: “GeForce GTX 970M”
CUDA Driver Version / Runtime Version 11.0 / 10.2
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 6144 MBytes (6442450944 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1038 MHz (1.04 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >