Please help for using cublas Zgemm~

Hello all,

I have been struggling to use Zgemm for a week now. I use Zgemm to do several matrix multiplication operations. However, I can’t even get the first multiplication result correct. Here is what I want to do
AB = C. A is 2x54, and B is 54x2, and C is 2x2. From reading cublas library manual, AB = C, s0 C is m x n, A is m x k, and B is k x n. => m = 2, k = 54, n = 2.

Here is the example code that I wrote to try the first multiplication:


#include <stdio.h>
#include <stdlib.h>
#include
#include
#include
#include
#include <cuda.h>
#include <math.h>
#include <cuComplex.h>
#include <complex.h>
#include
#include
#include “device_launch_parameters.h”
#include <thrust/device_vector.h>
#include <thrust/system_error.h>
#include <thrust/extrema.h>
#include <cublas_v2.h>

using namespace std;

int main(void)
{
cuDoubleComplex array_host, debug_host;
cuDoubleComplex nAI, Tresultk;
cudaMalloc((void **)&Tresultk , 2
54
sizeof(cuDoubleComplex));
cudaMalloc((void **)&nAI, 2
54
sizeof(cuDoubleComplex));
cuDoubleComplex M1resultk;
cudaMalloc((void **)&M1resultk, 2
2*sizeof(cuDoubleComplex));

cuDoubleComplex test[108];
//test = (cuDoubleComplex *) malloc(2*54*sizeof(cuDoubleComplex));
//test[0].x = 0.234567; test[0].y = 1.234567;
/*
test[1].x = 0.234567; test[1].y = 1.234567;
test[2].x = 0.234567; test[2].y = 1.234567; test[3].x = 0.234567; test[3].y = 1.234567;
test[4].x = 0.234567; test[4].y = 1.234567;
 */
for (int z = 0; z<54; z++)
{
	test[z].x = z+1; test[z].y = 0;
	test[z+54].x = z+1; test[z+54].y = 0;
}
//test[54].x = 0.234567; test[54].y = 1.234567;
//test[1].x = 0.234567; test[1].y = 1.234567;
for (int zz = 0; zz<108; zz++)
{
	cout << test[zz].x << "  " << test[zz].y << '\n';
}
//cublasStatus_t statt;
cudaMemcpy(nAI,test,2*54*sizeof(cuDoubleComplex),cudaMemcpyHostToDevice);
cudaError_t  code95 = cudaGetLastError();
if (code95 != cudaSuccess)
	printf ("95 cudamemcpy error -- %s\n", cudaGetErrorString(code95));
test[0].x = 1; test[0].y = 0;

for (int z = 1; z<108; z++)
{
	test[z].x = 0; test[z].y = 0;
}
test[1].x = 1; test[1].y =0;
for (int zz = 0; zz<108; zz++)
{
	cout << test[zz].x << "  " << test[zz].y << '\n';
}
cudaMemcpy(Tresultk,test,2*54*sizeof(cuDoubleComplex),cudaMemcpyHostToDevice);
cudaError_t  code96 = cudaGetLastError();
if (code96 != cudaSuccess)
	printf ("96 cudamemcpy error -- %s\n", cudaGetErrorString(code96));


debug_host = (cuDoubleComplex *) malloc(2*54*sizeof(cuDoubleComplex));
cudaMemcpy(debug_host,nAI,2*54*sizeof(cuDoubleComplex),cudaMemcpyDeviceToHost);

cudaError_t  code92 = cudaGetLastError();
if (code92 != cudaSuccess)
	printf ("92 cudamemcpy error -- %s\n", cudaGetErrorString(code92));
ofstream myfile1;
myfile1.open ("nAI_07232015.txt");
for (int kl = 0; kl<108; kl++)
{
	myfile1 << "X is ";
	myfile1 << debug_host[kl].x << "  " << "Y is ";
	myfile1 << debug_host[kl].y << '\n';
}
myfile1.close();
//cublasSetMatrix (2 ,2 , sizeof (cuDoubleComplex) ,c ,2 ,M1test , 2 );
cuDoubleComplex al, bet;
al.x=1; al.y=0; bet.x=0; bet.y=0;
cublasHandle_t handle4;
cublasCreate_v2(&handle4);
cublasZgemm_v2(handle4,CUBLAS_OP_N,CUBLAS_OP_N,2,2,54,&al,nAI,2,Tresultk,54,&bet,M1resultk,2);

array_host = (cuDoubleComplex *) malloc(2*2*sizeof(cuDoubleComplex));
cudaMemcpy(array_host,M1resultk,2*2*sizeof(cuDoubleComplex),cudaMemcpyDeviceToHost);
cudaError_t  code93 = cudaGetLastError();
if (code93 != cudaSuccess)
	printf ("93 cudamemcpy error -- %s\n", cudaGetErrorString(code93));
ofstream myfile;
myfile.open ("M1resultk_07232015.txt");
for (int kk = 0; kk<4; kk++)
{
	myfile << "X is ";
	myfile << array_host[kk].x << "  " << "Y is ";
	myfile << array_host[kk].y << '\n';
}
myfile.close();
cublasDestroy_v2(handle4);
cout << "done with M1 k" << '\n';

}


I make the first row of A has data from 1 to 54, such as A[0] =1, A[1] =2, A[2] =3, … , A[53] =54, and second row of A has same data as first row, such as A[54] =1, A[55] =2, A[56] =3, … , A[107] =54.
I also make B[0] = 1 and B[1]= 1 and rest of elements are all 0.

No matter how I change m, n, k, I can’t get the correct result.

Can someone please help me? Thank you very much!!!

I forgot to post my system information:


Detected 1 CUDA Capable device(s)

Device 0: “Tesla K40c”
CUDA Driver Version / Runtime Version 7.0 / 7.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla K40c
Result = PASS


Hi beyond009,

Your m,n,k,lda,ldb, ldc seem to be correct to me. The leading dimension in cublas can be confusing too, you should check this out http://stackoverflow.com/questions/16376804/clarification-of-the-leading-dimension-in-cublas-when-transposing

I’m having the same kind of issues with Zgemm and I havn’t figured out the whole thing yet but I suspect your (and my ^^ ) problems are due to the fact BLAS works with column-major arrays.
Not 100% sure about what i’m going to say but your cuDoubleComplex* arrays are oriented “the same” way to his eyes if you say both are CUBLAS_OP_N.

If you find a solution please let me know, I’m trying to use cublasZgeam to get my matrix the “right” way but it doesn’t seem to work.

Hynd92

Perhaps you can start with a simpler example, like A(2,5)*B(5,2) = C(2,2) ?
Do you understand the difference between row-major storage and column-major storage? (Because you appear to be initializing things in row-major format, but CUBLAS expects column-major)

What is the actual matrix multiply that you expect? Something like this?:

A:           *  B:    =  C:

1 2 3 4 5       1  1     ?  ? 
1 2 3 4 5       0  0     ?  ?
                0  0
                0  0
                0  0

Because in the above example, I would expect a result of:

C:

1  1
1  1

If you fix the row-major/column-major setup issue, you can easily get that result.

Here’s a modified version of your code that produces that result:

$ cat t846.cu
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuComplex.h>
#include <cublas_v2.h>

#define AROW 2
#define ACOL 54
#define BROW ACOL
#define BCOL AROW
#define CROW AROW
#define CCOL BCOL

// A(2,54) * B(54,2) = C(2,2)

using namespace std;

int main(void)
{
  cuDoubleComplex *array_host, *debug_host;
  cuDoubleComplex *nAI, *Tresultk;
  cudaMalloc((void **)&Tresultk , AROW*ACOL*sizeof(cuDoubleComplex));
  cudaMalloc((void **)&nAI, AROW*ACOL*sizeof(cuDoubleComplex));
  cuDoubleComplex *M1resultk;
  cudaMalloc((void **)&M1resultk, CROW*CCOL*sizeof(cuDoubleComplex));

  cuDoubleComplex test[108];
// initialize for column-major storage
  for (int z = 0; z<ACOL; z++)
  {
    test[(AROW*z)].x = z+1; test[(AROW*z)].y = 0;
    test[(AROW*z)+1].x = z+1; test[(AROW*z)+1].y = 0;
  }
#ifdef DEBUG_PRINT
  for (int zz = 0; zz<108; zz++)
  {
    cout << test[zz].x << " " << test[zz].y << '\n';
  }
#endif
  cudaMemcpy(nAI,test,AROW*ACOL*sizeof(cuDoubleComplex),cudaMemcpyHostToDevice);
  cudaError_t code95 = cudaGetLastError();
  if (code95 != cudaSuccess)
    printf ("95 cudamemcpy error -- %s\n", cudaGetErrorString(code95));
// initialize for column-major storage
  for (int z = 0; z<BROW*BCOL; z++)
  {
    test[z].x = 0; test[z].y = 0;
  }
  test[0].x = 1; test[0].y = 0;
  test[BROW].x = 1; test[BROW].y =0;
#ifdef DEBUG_PRINT
  for (int zz = 0; zz<108; zz++)
  {
    cout << test[zz].x << " " << test[zz].y << '\n';
  }
#endif
  cudaMemcpy(Tresultk,test,2*54*sizeof(cuDoubleComplex),cudaMemcpyHostToDevice);
  cudaError_t code96 = cudaGetLastError();
  if (code96 != cudaSuccess)
  printf ("96 cudamemcpy error -- %s\n", cudaGetErrorString(code96));

debug_host = (cuDoubleComplex *) malloc(2*54*sizeof(cuDoubleComplex));
  cudaMemcpy(debug_host,nAI,2*54*sizeof(cuDoubleComplex),cudaMemcpyDeviceToHost);

  cudaError_t code92 = cudaGetLastError();
  if (code92 != cudaSuccess)
  printf ("92 cudamemcpy error -- %s\n", cudaGetErrorString(code92));
#ifdef DEBUG_PRINT
  for (int kl = 0; kl<108; kl++)
  {
    std::cout << "X is ";
    std::cout << debug_host[kl].x << " " << "Y is ";
    std::cout << debug_host[kl].y << '\n';
  }
#endif
  cuDoubleComplex al, bet;
  al.x=1; al.y=0; bet.x=0; bet.y=0;
  cublasHandle_t handle4;
  cublasCreate_v2(&handle4);
  cublasStatus_t res = cublasZgemm_v2(handle4,CUBLAS_OP_N,CUBLAS_OP_N,2,2,54,&al,nAI,2,Tresultk,54,&bet,M1resultk,2);
  if (res != CUBLAS_STATUS_SUCCESS) {printf("oops! %d\n", (int)res);}
  array_host = (cuDoubleComplex *) malloc(2*2*sizeof(cuDoubleComplex));
  cudaMemcpy(array_host,M1resultk,2*2*sizeof(cuDoubleComplex),cudaMemcpyDeviceToHost);
  cudaError_t code93 = cudaGetLastError();
  if (code93 != cudaSuccess)
  printf ("93 cudamemcpy error -- %s\n", cudaGetErrorString(code93));
  for (int kk = 0; kk<4; kk++)
  {
    std::cout << "X is ";
    std::cout << array_host[kk].x << " " << "Y is ";
    std::cout << array_host[kk].y << '\n';
  }
  cublasDestroy_v2(handle4);
  cout << "done with M1 k" << '\n';

}
$ nvcc -o t846 t846.cu -lcublas
$ cuda-memcheck ./t846
========= CUDA-MEMCHECK
X is 1 Y is 0
X is 1 Y is 0
X is 1 Y is 0
X is 1 Y is 0
done with M1 k
========= ERROR SUMMARY: 0 errors
$

I understand it now. Thank you very much, txbob!!