CuBLAS: cublasGemmBatchedEx / cublasGemmStridedBatchedEx support for DP4A

cudapop1 · October 14, 2018, 1:28pm

Does the CUDA 10 versions of “cublasGemmBatchedEx” and “cublasGemmStridedBatchedEx” support DP4A instructions?

In the CUDA 10 documentation, it does not list CUDA_R_32I as a supported compute type for the batched/strided versions. This is in contrast to "“cublasGemmEx” (ie. non-batched, non-strided) which explicitly lists CUDA_R_32I as a supported compute type (ie. 8-bit INT multiply with 32-bit INT accumulate thus allowing use of DP4A instructions).

So is DP4A not supported for batched/strided GemmEx?

Topic		Replies	Views
cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI CUDA Programming and Performance	12	3829	September 25, 2017
About cublasGemm INT8 support GPU-Accelerated Libraries	3	2910	September 15, 2017
cublasGemmEx cant use CUDA_R_8I compute type on GTX1080 GPU-Accelerated Libraries	4	1444	February 12, 2018
INT8 cublasGemmEx support on Tegra X2 and Tesla P100 GPU-Accelerated Libraries	4	1932	October 17, 2017
How does cublasGemmEx() call work with CUDA_R_16F inputs and CUDA_R_32F computeType CUDA Programming and Performance	3	1969	December 10, 2017
Support for per-multiplication m, n, k, lda, ldb, and ldc in batched gemm GPU-Accelerated Libraries cublas	0	37	January 3, 2026
cublasGemmEx execution error code CUBLAS_STATUS_ARCH_MISMATCH GPU-Accelerated Libraries	1	1551	January 7, 2020
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	5076	November 2, 2023
Batch Matrix Multiplication using CuBLAS GPU-Accelerated Libraries tensorrt , cuda , kernel , c-plus-plus	17	4035	March 2, 2021
PGF90-S-0155-Could not resolve generic procedure cublasdgemm Legacy PGI Compilers	4	4023	November 10, 2018

CuBLAS: cublasGemmBatchedEx / cublasGemmStridedBatchedEx support for DP4A

Related topics