CGEMM problems

RDCSC · September 1, 2010, 3:54pm

Hello,

we are having some peculiar problems with CUBLAS’ CGEMM (cuda 3.1 on a Tesla C1060).

The first thing to note is that for highly rectangular matrices (half a dozen x millions), the CUBLAS implementation of CGEMM is slower than, say, MKL’s CBLAS. For instance, if A and B are 3x30,000,000 single-precision complex matrices, then CUBLAS takes about 16 times longer than MKL to compute A^H.B. It gets worse if you’re doing 1x30,000,000 - for which MKL is about 150 times faster!

The second problem is that for certain matrix sizes, CUBLAS spits out an ‘execution failed’ error. For example, 3x33,333,333 gives this error, whereas 5x33,333,333 does not. Likewise, 30x3,333,333 fails, but 50x3,333,333 does not. Below I’ve attached my code for testing this - I’d appreciate it if someone could see if they can reproduce it*. Strangely, this problem does not occur for real matrices (using SGEMM).

Thanks!

You will probably require at least 1.5 GB of RAM free on both the host and your device. To compile, the following should do it:

nvcc main.cu -lcublas

Then when you run it, you will be prompted for M and then N. Please try M=3 and N=33333333 first and let me know what output you get…

#include <stdio.h>

#include <stdlib.h>

#include <cublas.h>

#include <cuda.h>

#define RE 1.0

#define IM 2.0

void checkCudaError();

void checkCublasError();

int main(void)

{

  cuComplex *hX, *hY;

  cuComplex *dX, *dY;

  cuComplex one, zero;

  one.x=1.0; one.y=0.0;

  zero.x=0.0; zero.y=0.0;

  int i,M,N;

// Get problem size

  printf("M = ");

  scanf("%i",&M);

  printf("N = ");

  scanf("%i",&N);

// Allocate device memory

  cudaMalloc((void**)&dX,N*M*sizeof(cuComplex));

  cudaMalloc((void**)&dY,M*M*sizeof(cuComplex));

  checkCudaError();

// Allocate host memory

  hX=(cuComplex*)malloc(M*N*sizeof(cuComplex));

  hY=(cuComplex*)malloc(M*M*sizeof(cuComplex));

// Fill host memory

  for(i=0; i<M*N; i++)

  {

	hX[i].x=RE; hX[i].y=IM;

  }

// Copy host matrix to device

  cudaMemcpy(dX,hX,N*M*sizeof(cuComplex),cudaMemcpyHostToDevice);

  checkCudaError();

// Perform the multiplication

  cublasCgemm('C','N', M, M, N, one, dX, N, dX, N, zero, dY, M);

  checkCublasError();

// Cleanup

  free(hX);

  free(hY);

  cudaFree(dX);

  cudaFree(dY);

  checkCudaError();

printf("Passed!\n");

}

void checkCudaError()

{

   cudaError_t error = cudaGetLastError();

   if(error!=cudaSuccess) {

	  printf("CUDA ERROR: %s\n", cudaGetErrorString(error) );

	  exit(-1);

   }

}

void checkCublasError()

{

  cublasStatus cs=cublasGetError();

  if(cs==CUBLAS_STATUS_SUCCESS) return;

  switch(cs)

  {

  case CUBLAS_STATUS_NOT_INITIALIZED:

	printf("CUBLAS ERROR: Not initialised.\n");

	break;

  case CUBLAS_STATUS_ALLOC_FAILED:

	printf("CUBLAS ERROR: Alloc failed.\n");

	break;

  case CUBLAS_STATUS_INVALID_VALUE:

	printf("CUBLAS ERROR: Invalid value.\n");

	break;

  case CUBLAS_STATUS_ARCH_MISMATCH:

	printf("CUBLAS ERROR: Arch mismatch.\n");

	break;

  case CUBLAS_STATUS_MAPPING_ERROR:

	printf("CUBLAS ERROR: Mapping error.\n");

	break;

  case CUBLAS_STATUS_EXECUTION_FAILED:

	printf("CUBLAS ERROR: Execution failed.\n");

	break;

  case CUBLAS_STATUS_INTERNAL_ERROR:

	printf("CUBLAS ERROR: Internal error.\n");

	break;

  }

  exit(-2);

}

philippev · September 1, 2010, 8:37pm

You should check explicitly the return of every cudaMalloc. Some sizes that you give cannot fit in a 4Gbytes boards. cuComplex type is 8 bytes.
Also, the sizes M,N should be of type size_t ( 64 bit) . Your arithmetic computing the size of the matrix can be truncated.

This explains why some bigger matrices seems to work (even though they do not fit in the GPU) whereas smaller fails

philippev · September 1, 2010, 8:37pm

You should check explicitly the return of every cudaMalloc. Some sizes that you give cannot fit in a 4Gbytes boards. cuComplex type is 8 bytes.
Also, the sizes M,N should be of type size_t ( 64 bit) . Your arithmetic computing the size of the matrix can be truncated.

This explains why some bigger matrices seems to work (even though they do not fit in the GPU) whereas smaller fails

RDWD · September 1, 2010, 9:39pm

Thanks for the response, but that doesn’t appear to be the problem. I now define M and N as size_t, and am explicitly checking my cudaMalloc calls, but they return cudaSuccess each time…

RDWD · September 1, 2010, 9:39pm

Thanks for the response, but that doesn’t appear to be the problem. I now define M and N as size_t, and am explicitly checking my cudaMalloc calls, but they return cudaSuccess each time…

RDCSC · September 1, 2010, 10:27pm

That was clever of me. RDWD is another account of mine.

RDCSC · September 1, 2010, 10:27pm

That was clever of me. RDWD is another account of mine.

eelsen · September 1, 2010, 10:45pm

note that it is 5 * (3.3 * 10 ^7) but 50 * (3.3 * 10^6), so all the matrices should fit: 1.3 GBs each = 2.6GB total.

eelsen · September 1, 2010, 10:45pm

note that it is 5 * (3.3 * 10 ^7) but 50 * (3.3 * 10^6), so all the matrices should fit: 1.3 GBs each = 2.6GB total.

RDCSC · September 1, 2010, 11:05pm

It’s 3 x 33,333,333 or 50 x 3,333,333 (you have one too many 3’s there). So:

50 * 3333333 * 8 / 2^30 = 1.24 GB per matrix.

RDCSC · September 1, 2010, 11:05pm

It’s 3 x 33,333,333 or 50 x 3,333,333 (you have one too many 3’s there). So:

50 * 3333333 * 8 / 2^30 = 1.24 GB per matrix.

RDCSC · September 5, 2010, 3:34pm

Does anyone have a Tesla they could test my code on?

RDCSC · September 5, 2010, 3:34pm

Does anyone have a Tesla they could test my code on?

rudaoshi · February 1, 2011, 5:02pm

I have a Tesla and using Cuda 3.2. I found similar strange problem about the sgemm. I’m trying to solve it.

Was the problem of you solved? If so, please tell me the solution. Thank you very much.

janfcf · February 2, 2011, 8:38am

I use CUDA 3.2 RC2 on a Tesla S2050 ( 4GPU + 12 GB GDDR5 ECC in total = 2,625 GB per GPU (-12,5% cause of ECC)).

Test 1: M=3; N=33333333 = passed

Test 2: M=5; N=33333333 = out of memory

Test 3: M=30;N=3333333 = passed

Test 4: M=50;N=3333333 = out of memory

I agree with philippev, not enough memory for bigger matrices.

Topic		Replies	Views
cublas problem with very big matrixes and cublasDgemm slow CUDA Programming and Performance	2	1071	February 23, 2017
cublasDgemm returns wrong results for large matrix dimensions? CUDA Programming and Performance	12	3343	November 30, 2010
cublasSgemm gives incorrect result with big matrix CUDA Programming and Performance cuda	0	410	June 26, 2020
cublasSgemm gives incorrect result with big matrix CUDA Programming and Performance cuda	1	481	June 28, 2020
Multiply large matrices with cublasSgemm CUDA Programming and Performance	8	1706	April 12, 2017
cublas large matrix multiplication large matrices won't compute CUDA Programming and Performance	4	3606	January 17, 2008
cuBLAS fails when matrix has more than 2^31-1 entries? CUDA Programming and Performance	13	908	October 12, 2021
cublasSgemm() alway fail during compute intensify task CUDA Programming and Performance	14	4761	January 8, 2015
cgemm operation returns wrong result Error in C Code? CUDA Programming and Performance	8	1795	August 25, 2009
cublas<S,D,C,Z>GEMM crash on multi GPUs CUDA Programming and Performance	8	2017	November 27, 2014

CGEMM problems

Related topics