CUDA 2.0 seems to fail for long executions multiple process on one card fail

Hi,

I am developing a hybrid CPU/GPU scheduler and I am having some problems with NVIDIA card running CUDA 2.0 beta.

I have several CUDA applications and I launch different sets of them (around 4 processes each time) many times. In order to execute the whole set of combinations for different applications it takes some hours.

The problem is that after being executing for some time, the card/driver seems to start working strange. It is difficult to post some code which allows to reproduce the error, because it just randomly appears.

As far as I know when some process executes on CUDA and it finishes all the GPU memory is freed (even if cudaFree() is not called). But after executing many processes eventually the next execution fails with different errors (depending on the application failing), but it seems to be lack of GPU memory. In fact if I try to allocate just 1 byte in the card, it says “out of memory”.

Just in order to give some examples of the errors I get I attach a small list:

CUDA error: out of memory
CUBLAS: Library has not been initialized. ==> but call to cublasInit() it is indeed in the code
CUBLAS: Object could not be allocated due to lack of resources.
ERROR: CUFFT_EXEC_FAILED
CUDA error: the launch timed out and was terminated
CUDA error: unspecified launch failure

So as you can see I am using a mix of just-CUDA, CUFFT and CUBLAS applications. The errors start appearing in most of the executions after some hours executing. Then everything starts to fail.

If you have any insight why this is happening I would be very glad!

Thanks,
Victor

I don’t see anything attached to your post. If you’d like further assistance, please attach a test app which reproduces this problem, along with an nvidia-bug-report.log.

Hi,

I am sorry. I did not heard about nvidia bug log before.

So here it is some code which eventually fails. I also attach bug log.

// NxN matrix multiply

void kernel_matmul_gpu(const float* A, const float* B, float* C, int N) {

  float alpha = 1.0;

  float beta = 0.0;

  int lda = N;

  int ldb = N;

  int ldc = N;

  float* d_A;

  float* d_B;

  float* d_C;

 int M = N;

  int K = N;

 chk_cublas(cublasAlloc(M*K,sizeof(float),(void**)&d_A));

  chk_cublas(cublasAlloc(K*N,sizeof(float),(void**)&d_B));

  chk_cublas(cublasAlloc(M*N,sizeof(float),(void**)&d_C));

 chk_cublas(cublasSetMatrix(M,K,sizeof(float),A,lda,d_A,lda));

  chk_cublas(cublasSetMatrix(K,N,sizeof(float),B,ldb,d_B,ldb));

 /* perform C := alpha*op(A)*op(B) + beta*C */

  cublasSgemm('N','N',M,N,K,alpha,d_A,lda,d_B,ldb,beta,d_C,ldc);

  chk_cublas(cublasGetError());

 chk_cublas(cublasGetMatrix(M,N,sizeof(float),d_C,ldc,C,ldc));

 chk_cublas(cublasFree(d_A));

  chk_cublas(cublasFree(d_B));

  chk_cublas(cublasFree(d_C));

}

void chk_cublas(cublasStatus status)

{

  if (status == CUBLAS_STATUS_SUCCESS)

    return;

  switch (status) {

    case CUBLAS_STATUS_NOT_INITIALIZED:

      cerr << "CUBLAS: Library has not been initialized." << endl;

      break;

    case CUBLAS_STATUS_ALLOC_FAILED:

      cerr << "CUBLAS: Object could not be allocated due to lack of resources." << endl;

      break;

    case CUBLAS_STATUS_INVALID_VALUE:

      cerr << "CUBLAS: Invalid value." << endl;

      break;

    case CUBLAS_STATUS_MAPPING_ERROR:

      cerr << "CUBLAS: An error occurred accessing GPU memory." << endl;

      break;

    case CUBLAS_STATUS_EXECUTION_FAILED:

      cerr << "CUBLAS: Function failed to launch on GPU." << endl;

      break;

    case CUBLAS_STATUS_INTERNAL_ERROR:

      cerr << "CUBLAS: Object could not be deallocated." << endl;

      break;

  }

  exit(-1);

}

Just tell me if something else is needed.

Thanks,

Victor

The file was not attached in the previous post. I needed to change .log extension, so I added .txt at the end.

Victor
nvidia_bug_report.log.txt (113 KB)

This time the machine just rebooted at some time of the execution and looking at the system logs I could find this information. It seems related to the problem and I hope it can help:

Thanks,

Victor

Today my machine became unresponsive after executing some CUDA programs in parallel. I was able to obtain this trace:

I hope it can help.

BTW, any clue on the possible error?

Víctor