Need help correcting cublasGemmEx parameters

Hi I just started coding , and I need some help with cublasgemmex(). I dont really understand how leading dimension thing works. assuming i have these 2 arrays A[r][c] , B[c][k] and I wish to multiply them like so AxB
what would be the lda, ldb and ldc in this case?

Also here is the code that I was trying

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cublas_v2.h>
#include <curand.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <cstdlib>

using namespace std;

template<typename T> void printMatrix(int rowCount, int colCount, const T* matrix) {
	for (int i = 0; i < rowCount; i++) {
		for (int j = 0; j < colCount; j++) {
			cout << matrix[j * colCount + i] << "\t";
		}
		cout << endl;
	}
}

int main() {
	// Problem size
	int r =  4;
	int c =  3;

	// Declare pointers to matrices on device and host
	float* h_a, * h_b, * h_c;
	float* d_a, * d_b, * d_c;
	size_t bytesa = r * c * sizeof(float);
	size_t bytesb = c * r * sizeof(float);
	size_t bytesc = r * r * sizeof(float);

	// Allocate memory
	h_a = (float*)malloc(bytesa);
	h_b = (float*)malloc(bytesb);
	h_c = (float*)malloc(bytesc);
	cudaMalloc(&d_a, bytesa);
	cudaMalloc(&d_b, bytesb);
	cudaMalloc(&d_c, bytesc);

	// Pseudo random number generator
	curandGenerator_t prng;
	curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_DEFAULT);

	// Set the seed
	curandSetPseudoRandomGeneratorSeed(prng, (unsigned long long)clock());

	// Fill the matrix with random numbers on the device
	curandGenerateUniform(prng, d_a, r * c);
	curandGenerateUniform(prng, d_b, c * r);


	// cuBLAS handle
	cublasHandle_t handle;
	cublasCreate(&handle);

	// Scalaing factors
	float alpha = 1.0f;
	float beta = 0.0f;

	cublasStatus_t status;
	status = cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, r, c, r, &alpha, d_a,CUDA_R_16F, r, d_b,
 CUDA_R_16F, c, &beta, d_c,CUDA_R_16F, r, CUDA_R_16F, CUBLAS_GEMM_DEFAULT_TENSOR_OP);
	
	// Copy back the three matrices
	cudaMemcpy(h_a, d_a, bytesa, cudaMemcpyDeviceToHost);
	cudaMemcpy(h_b, d_b, bytesb, cudaMemcpyDeviceToHost);
	cudaMemcpy(h_c, d_c, bytesc, cudaMemcpyDeviceToHost);

	printMatrix(r, c, h_a);
	printMatrix(c, r, h_b);
	printMatrix(r, r, h_c);

	if (status != CUBLAS_STATUS_SUCCESS) {
		fprintf(stderr, "!!!! kernel execution error.\n");
		return EXIT_FAILURE;
	}
	// Verify solution
	//verify_solution(h_a, h_b, h_c, n);

	printf("COMPLETED SUCCESSFULLY\n");

	return 0;
}

but i get some errors

** On entry to GEMM_EX  parameter number 12 had an illegal value
0.45563 0.444989        0.831396
0.796939        0.67204 0.182175
0.178881        0.830081        0.939539
0.444989        0.831396        0.703919
0.0115277       0.983714        0.824535        -4.22017e+37
0.241401        0.36433 0.539053        0
0.0816302       0.27666 0.749799        -3.77434e-28
0       0       0       0
0       0       0       0
0       0       0       0
0       0       0       0
!!!! kernel execution error.

would greatly appreciate any help
.

https://stackoverflow.com/questions/8206563/purpose-of-lda-argument-in-blas-dgemm

https://www.math.utah.edu/software/lapack/lapack-blas/dgemm.html

hmmmm really sorry but I still don’t get why the code is wrong though. my A matrix is A[r][c] and my B matrix is B[c][r] so my m,n,k should be r,c,r respectively?
Then from what i’m seeing lda,ldb,ldc should be c,r,r? or r,c,r?
Either way i still get a parameter number 9 or 12 error, so it could be some other problem.

I’m really really sorry cause I feel like i’m missing something really obvious here. But i’m still confused. According to the the quote

When  TRANSB =
             'N' or 'n' then LDB must be <b>at least  max( 1, k )</b>

And since Matrix B in my code is using CUBLAS_OP_N therefore TRANSB should be = N ? which in this case means ldb should be max (1,k) or c?
only if it is otherwise then it should be n as stated here

otherwise  LDB must be at least  max( 1, n ).

Also as per your recommendation i changed the ldb to r and this is what I got which i think is still wrong

0.247245        0.547517        0.372891
0.436407        0.819102        0.841676
0.268685        0.141773        0.618597
0.547517        0.372891        0.582431
0.383606        0.959435        0.963584        -4.22017e+37
0.851904        0.646364        0.38808 0
0.148838        0.608849        0.492766        0.00146493
4.59177e-41     4.59177e-41     0       0
4.59163e-41     0       0       0
4.59177e-41     0       0       0
4.59163e-41     0       0       0
COMPLETED SUCCESSFULLY

Thanks so much for all the replies by the way , really appreciate it.

Okay. I think I’ve swapped the transpose operations on my last post. I’m going to remove it and redo the example when it back to my linux machine.

Ok, thanks so much!

Just putting this out there: if A has shape [r,c] and B has shape [c,r] then parameters m,n,k should be r,r,c and not r,c,r as you have in the code. The inner dimension is passed last instead of being passed in order. If you’ve got the wrong dimensions for m,n,k then your leading dimensions will likely be considered wrong as well.