cublasDx batched gather gemm

I’m trying to figure out how I can make a batched gemm that can gather columns from indices of a source tensor. It’s going to do A^T A effectively on the columns that the batch needs. The source tensor is column major and padded so the column copies from gmem to smem should be ideal (except for the index load). I’m trying to figure out how I can represent the source tensor this way with cublasDx to take advantage of the pipeline, vectorized loads and or cp.async that cublasDx does under the hood.

Hello,

I would like to suggest a solution, but I don’t understand your problem fully right now. Could you give an example or point me to a formal description?

Hello,

Thanks. This is pretty much a batched syrk, which would be [M,K,L] → [M,M,L]. The gather part is just that the inputs are instead [M,K], [32,L] (of indices each [0,K) ) output is still [M,M,L]. The indices select 32 columns from [M,K] for each batched gemm.

The tensors are going to be column major and padded. So besides the indirection for the indices, each column should be able to load coalesced or with cp.async etc..

I’m wondering if cublasDx would be able to express loading these columns while taking advantage of the contiguous columns. Or do I need something like cute C++ to do this combined with cublasDx?

Thanks!

This usecase should be possible with cuBLASDx, maybe with small reaches to CuTe directly.

There are 2 parts here:

  1. Partitioning your tensors for broadcast operation, cuBLASDx uses CuTe tensors as its datatype so you can use cute::local_partition or cute::local_tile and friends without a problem. Some simple cases can be done in cuBLASDx directly by constructing your tensors upfront with easy to slice layouts (can be hierarchical) and then using cublasdx::slice to get appropriate sub-tensor (e.g. getting a column out of a tensor is as easy as local_tensor(cublasdx::slice, col_id)). For more complicated strided partitions you need to use aforementioned CuTe layout algebra.
  2. Copying tensors to fully utilize vectorization possibilities: for cooperative (more than 1 thread) copies of any tensor you can use cublasdx::copy and it will autovectorize and find all possible contiguous regions. It sounds to me that your usecase should be covered.

If you have problems with expressing your problem with CuTe / cuBLASDx you can share your code either through here or CUDALibrarySamples GitHub issues.