cublasGemmBatchedEx requires pointers aligned to bytes cases below
1.Atype is CUDA_R_16F || CUDA_R_16BF
2.computeType is any of FAST option
3.algo enable fast math modes
with following rules
1.if k % 8 == 0, intptr_t(ptr) % 16 == 0
2.if k % 2 == 0, intptr_t(ptr) % 4 == 0
The problem I faced is that
1.lhs_shape →[5,128,8,1,64] rhs_shape→[5,1,8,64,201]
2.Atype = float, computeType = CUBLAS_COMPUTE_32F_FAST_TF32
3.mathmode = CUBLAS_TF32_TENSOR_OP_MATH
thus the output shape is [5,128,8,1,201], and the ptrs of matrixes in output will not meet the alignment requirement.
But I success in use cublasGemmBatchedEx under cublas.12.2.1.16, without missaligned error
Can anyone explain it?