Matrix multiplication with CublasSgemm

I have a newbie question. I trying to use cublasSgemm, but fail to get right result. I have tried to change the order of the matrixes in the call, tried to transpose, and every thing else I could think of. But after trying differnt combinations for more than a day I would be really appreciate if any one could point me in the right direction.

Im trying to multiply matrix A (1x3) with matrix B (3x4) and expects matrix C to be 1x4. As far as I understand I should exchange A and B in the call due to the fact that cublasSgemm use fortran matrix representation. So I end up with the following call:

const float alpha = 1.0f;
const float beta = 0.0f;
ret = cublasSgemm
(
handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
MatrixA_height,
MatrixB_width,
MatrixA_width,
&alpha,
d_MatrixB,
MatrixB_width,
d_MatrixA,
MatrixA_width,
&beta,
d_MatrixC,
MatrixA_width
);

But I get wrong result. As I said, I’m starting to give up and soon I don’t want to hear any one talk about cuda. :).

Full progam below:

*********************** FULL CODE *************************************

#include
#include
#include
#include

int matrixMultiply()
{

int devID = 0;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, devID);

//Matrix MatrixA
int MatrixA_height = 1;
int MatrixA_width = 3;

//Matrix MatrixB
int MatrixB_height = 3;
int MatrixB_width = 4;

int MatrixC_width = MatrixB_width;
int MatrixC_height = MatrixA_height;
   
// allocate host memory for matrices MatrixA
unsigned int MatrixA_size = MatrixA_width * MatrixA_height;
unsigned int MatrixA_mem_size = sizeof(float) * MatrixA_size;
float h_MatrixA[3] = {0.5f, -0.5f, 1.0f};

// allocate host memory for matrices MatrixB
unsigned int MatrixB_size = MatrixB_width * MatrixB_height;
unsigned int MatrixB_mem_size = sizeof(float) * MatrixB_size;
float h_MatrixB[12] = { -0.9f, -0.8f, -0.7f, -0.6f, -0.5f, -0.4f, -0.3f, -0.2f, -0.1f, 0.0f, 0.0f, 0.0f };

// allocate host memory for the result
float h_MatrixC[4];
unsigned int MatrixC_size = MatrixC_width * MatrixC_height;
unsigned int MatrixC_mem_size = sizeof(float) * MatrixC_size;

// allocate device memory
float *d_MatrixA, *d_MatrixB, *d_MatrixC;
cudaMalloc((void **) &d_MatrixA, MatrixA_mem_size);
cudaMalloc((void **) &d_MatrixB, MatrixB_mem_size);
cudaMemcpy(d_MatrixA, h_MatrixA, MatrixA_mem_size, cudaMemcpyHostToDevice);
cudaMemcpy(d_MatrixB, h_MatrixB, MatrixB_mem_size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_MatrixC, MatrixC_mem_size);
cublasHandle_t handle;
cublasCreate(&handle);
cublasStatus_t ret;

// make the call to cublas
const float alpha = 1.0f;
const float beta  = 0.0f;
ret = cublasSgemm
(
	handle, 
	CUBLAS_OP_N, 
	CUBLAS_OP_N, 
	MatrixA_height, 
	MatrixB_width, 
	MatrixA_width, 
	&alpha, 
	d_MatrixB, 
	MatrixB_width, 
	d_MatrixA, 
	MatrixA_width, 
	&beta, 
	d_MatrixC, 
	MatrixA_width
);

if (ret != CUBLAS_STATUS_SUCCESS)
{
    printf("cublasSgemm returned error code %d, line(%d)\n", ret, __LINE__);
    return 1;
}

// copy result from device to host
cudaMemcpy(h_MatrixC, d_MatrixC, MatrixC_mem_size, cudaMemcpyDeviceToHost);

for(int i = 0; i< MatrixC_size; i++) 
{
	printf("%d: %f", i, h_MatrixC[i]);
}

cudaFree(d_MatrixA);
cudaFree(d_MatrixB);
cudaFree(d_MatrixC);

return 0;

}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main()
{
matrixMultiply();
cudaDeviceReset();
getchar();
return 0;
}

Can you just modify cublas sample for your needs?

I got most of my code ftom the cublas samle, but when I run the unmodified sample I get the text below. I fail to understand how you can multiply two 320 x 640 matrixes and end up with another 320 x 640 matrix. Something seems broken, or do I make a fool of my self?

MatrixA(320,640), MatrixB(320,640), MatrixC(320,640)
Computing result using CUBLAS…done.
Performance= 427.02 GFlop/s, Time= 0.307 msec, Size= 131072000 Ops

But thanks for the answer! :)