CUDA 2.0 seems to fail for long executions multiple process on one card fail

victorjj · June 13, 2008, 9:13am

Hi,

I am developing a hybrid CPU/GPU scheduler and I am having some problems with NVIDIA card running CUDA 2.0 beta.

I have several CUDA applications and I launch different sets of them (around 4 processes each time) many times. In order to execute the whole set of combinations for different applications it takes some hours.

The problem is that after being executing for some time, the card/driver seems to start working strange. It is difficult to post some code which allows to reproduce the error, because it just randomly appears.

As far as I know when some process executes on CUDA and it finishes all the GPU memory is freed (even if cudaFree() is not called). But after executing many processes eventually the next execution fails with different errors (depending on the application failing), but it seems to be lack of GPU memory. In fact if I try to allocate just 1 byte in the card, it says “out of memory”.

Just in order to give some examples of the errors I get I attach a small list:

CUDA error: out of memory
CUBLAS: Library has not been initialized. ==> but call to cublasInit() it is indeed in the code
CUBLAS: Object could not be allocated due to lack of resources.
ERROR: CUFFT_EXEC_FAILED
CUDA error: the launch timed out and was terminated
CUDA error: unspecified launch failure

So as you can see I am using a mix of just-CUDA, CUFFT and CUBLAS applications. The errors start appearing in most of the executions after some hours executing. Then everything starts to fail.

If you have any insight why this is happening I would be very glad!

Thanks,
Victor

netllama · June 13, 2008, 1:37pm

I don’t see anything attached to your post. If you’d like further assistance, please attach a test app which reproduces this problem, along with an nvidia-bug-report.log.

victorjj · June 13, 2008, 1:53pm

Hi,

I am sorry. I did not heard about nvidia bug log before.

So here it is some code which eventually fails. I also attach bug log.

// NxN matrix multiply

void kernel_matmul_gpu(const float* A, const float* B, float* C, int N) {

  float alpha = 1.0;

  float beta = 0.0;

  int lda = N;

  int ldb = N;

  int ldc = N;

  float* d_A;

  float* d_B;

  float* d_C;

 int M = N;

  int K = N;

 chk_cublas(cublasAlloc(M*K,sizeof(float),(void**)&d_A));

  chk_cublas(cublasAlloc(K*N,sizeof(float),(void**)&d_B));

  chk_cublas(cublasAlloc(M*N,sizeof(float),(void**)&d_C));

 chk_cublas(cublasSetMatrix(M,K,sizeof(float),A,lda,d_A,lda));

  chk_cublas(cublasSetMatrix(K,N,sizeof(float),B,ldb,d_B,ldb));

 /* perform C := alpha*op(A)*op(B) + beta*C */

  cublasSgemm('N','N',M,N,K,alpha,d_A,lda,d_B,ldb,beta,d_C,ldc);

  chk_cublas(cublasGetError());

 chk_cublas(cublasGetMatrix(M,N,sizeof(float),d_C,ldc,C,ldc));

 chk_cublas(cublasFree(d_A));

  chk_cublas(cublasFree(d_B));

  chk_cublas(cublasFree(d_C));

}

void chk_cublas(cublasStatus status)

{

  if (status == CUBLAS_STATUS_SUCCESS)

    return;

  switch (status) {

    case CUBLAS_STATUS_NOT_INITIALIZED:

      cerr << "CUBLAS: Library has not been initialized." << endl;

      break;

    case CUBLAS_STATUS_ALLOC_FAILED:

      cerr << "CUBLAS: Object could not be allocated due to lack of resources." << endl;

      break;

    case CUBLAS_STATUS_INVALID_VALUE:

      cerr << "CUBLAS: Invalid value." << endl;

      break;

    case CUBLAS_STATUS_MAPPING_ERROR:

      cerr << "CUBLAS: An error occurred accessing GPU memory." << endl;

      break;

    case CUBLAS_STATUS_EXECUTION_FAILED:

      cerr << "CUBLAS: Function failed to launch on GPU." << endl;

      break;

    case CUBLAS_STATUS_INTERNAL_ERROR:

      cerr << "CUBLAS: Object could not be deallocated." << endl;

      break;

  }

  exit(-1);

}

Just tell me if something else is needed.

Thanks,

Victor

victorjj · June 13, 2008, 1:56pm

The file was not attached in the previous post. I needed to change .log extension, so I added .txt at the end.

Victor
nvidia_bug_report.log.txt (113 KB)

victorjj · June 13, 2008, 2:10pm

This time the machine just rebooted at some time of the execution and looking at the system logs I could find this information. It seems related to the problem and I hope it can help:

Jun 13 14:16:02 gf84 kernel: [ 4171.982729] irq 16: nobody cared (try booting with the “irqpoll” option)

Jun 13 14:16:02 gf84 kernel: [ 4171.982753] [__report_bad_irq+36/128] __report_bad_irq+0x24/0x80

Jun 13 14:16:02 gf84 kernel: [ 4171.982764] [note_interrupt+606/656] note_interrupt+0x25e/0x290

Jun 13 14:16:02 gf84 kernel: [ 4171.982769] [handle_IRQ_event+48/96] handle_IRQ_event+0x30/0x60

Jun 13 14:16:02 gf84 kernel: [ 4171.982774] [handle_fasteoi_irq+193/240] handle_fasteoi_irq+0xc1/0xf0

Jun 13 14:16:02 gf84 kernel: [ 4171.982779] [do_IRQ+64/128] do_IRQ+0x40/0x80

Jun 13 14:16:02 gf84 kernel: [ 4171.982784] [] rm_isr+0x9c/0xef [nvidia]

Jun 13 14:16:02 gf84 kernel: [ 4171.982939] [common_interrupt+35/48] common_interrupt+0x23/0x30

Jun 13 14:16:02 gf84 kernel: [ 4171.982946] [do_gettimeofday+122/240] do_gettimeofday+0x7a/0xf0

Jun 13 14:16:02 gf84 kernel: [ 4171.982954] [sys_gettimeofday+26/128] sys_gettimeofday+0x1a/0x80

Jun 13 14:16:02 gf84 kernel: [ 4171.982959] [sysenter_past_esp+105/169] sysenter_past_esp+0x69/0xa9

Jun 13 14:16:02 gf84 kernel: [ 4171.982967] =======================

Jun 13 14:16:02 gf84 kernel: [ 4171.982969] handlers:

Jun 13 14:16:02 gf84 kernel: [ 4171.982970] [] (usb_hcd_irq+0x0/0x60 [usbcore])

Jun 13 14:16:02 gf84 kernel: [ 4171.982983] [] (nv_kern_isr+0x0/0xd0 [nvidia])

Jun 13 14:16:02 gf84 kernel: [ 4171.983095] Disabling IRQ #16

Thanks,

Victor

victorjj · June 16, 2008, 3:02pm

Today my machine became unresponsive after executing some CUDA programs in parallel. I was able to obtain this trace:

gf84 kernel: [ 4752.948606] [] _nv005176rm+0x4e/0x75 [nvidia]

gf84 kernel: [ 4752.948743] [] _nv002961rm+0x195/0x5fd [nvidia]

gf84 kernel: [ 4752.948869] [] rm_ioctl+0x3e/0x6d [nvidia]

gf84 kernel: [ 4752.946314] Oops: 0000 [#1]

gf84 kernel: [ 4723.615374] Disabling IRQ #16

gf84 kernel: [ 4752.946315] SMP

gf84 kernel: [ 4752.946372] CPU: 1

gf84 kernel: [ 4752.946374] EFLAGS: 00013246 (2.6.20-16-generic #2)

gf84 kernel: [ 4752.946521] EIP is at _nv005743rm+0x7/0x4d [nvidia]

gf84 kernel: [ 4752.946523] eax: 00000000 ebx: 00000001 ecx: 00000000 edx: 00000000

gf84 kernel: [ 4752.946525] esi: 00000002 edi: 00000000 ebp: f785f8f8 esp: f7477c7c

gf84 kernel: [ 4752.946526] ds: 007b es: 007b ss: 0068

gf84 kernel: [ 4752.946529] Process Xorg (pid: 4945, ti=f7476000 task=dffe0030 task.ti=f7476000)

gf84 kernel: [ 4752.946530] Stack: 00000001 00000002 f7463000 f94e2a79 00000000 00000000 00000000 f7463000

gf84 kernel: [ 4752.946535] f74d3c00 00000002 00000007 00000006 f785f900 0000001a f7463000 f74d3c00

gf84 kernel: [ 4752.946540] f94fb956 f7463000 f74d3c00 f785f9f0 0000000b f785f9f0 f7463000 f9658e1b

gf84 kernel: [ 4752.946544] Call Trace:

gf84 kernel: [ 4752.946556] [] _nv009554rm+0x8b/0x1a0 [nvidia]

gf84 kernel: [ 4752.946693] [] _nv009547rm+0x1d6/0x208 [nvidia]

gf84 kernel: [ 4752.946830] [] _nv003289rm+0x144/0x23e [nvidia]

gf84 kernel: [ 4752.946968] [] _nv003288rm+0x42d/0x517 [nvidia]

gf84 kernel: [ 4752.947102] [] _nv009230rm+0x8a1/0xc79 [nvidia]

gf84 kernel: [ 4752.947240] [] _nv009236rm+0x39/0x3f [nvidia]

gf84 kernel: [ 4752.947375] [] _nv009224rm+0x9e/0x890 [nvidia]

gf84 kernel: [ 4752.947513] [] _nv005822rm+0x439/0x5df [nvidia]

gf84 kernel: [ 4752.947652] [] _nv009533rm+0x178/0x788 [nvidia]

gf84 kernel: [ 4752.947789] [] _nv009538rm+0x116/0x156 [nvidia]

gf84 kernel: [ 4752.947925] [] _nv009539rm+0x74/0x158 [nvidia]

gf84 kernel: [ 4752.948060] [] _nv003014rm+0x41f/0x449 [nvidia]

gf84 kernel: [ 4752.948198] [] _nv003009rm+0x93/0x27f [nvidia]

gf84 kernel: [ 4752.948334] [] _nv003012rm+0xc9/0x199 [nvidia]

gf84 kernel: [ 4752.948470] [] _nv003015rm+0x36/0x2e1 [nvidia]

gf84 kernel: [ 4752.949125] [] nv_kern_vma_release+0x65/0x130 [nvidia]

gf84 kernel: [ 4752.949243] [set_intr_gate+25/48] set_intr_gate+0x19/0x30

gf84 kernel: [ 4752.949248] [] nv_kern_unlocked_ioctl+0x18/0x20 [nvidia]

gf84 kernel: [ 4752.949364] [] nv_kern_unlocked_ioctl+0x0/0x20 [nvidia]

gf84 kernel: [ 4752.949472] [do_ioctl+43/144] do_ioctl+0x2b/0x90

gf84 kernel: [ 4752.949484] [vfs_ioctl+92/672] vfs_ioctl+0x5c/0x2a0

gf84 kernel: [ 4752.949489] [sys_ioctl+114/144] sys_ioctl+0x72/0x90

gf84 kernel: [ 4752.949492] [sysenter_past_esp+105/169] sysenter_past_esp+0x69/0xa9

gf84 kernel: [ 4752.949527] EIP: [] _nv005743rm+0x7/0x4d [nvidia] SS:ESP 0068:f7477c7c

gf84 kernel: [ 4752.949503] =======================

gf84 kernel: [ 4752.949007] [] nv_kern_ioctl+0x117/0x3b0 [nvidia]

gf84 kernel: [ 4752.946373] EIP: 0060:[] Tainted: P VLI

gf84 kernel: [ 4752.949496] [set_intr_gate+25/48] set_intr_gate+0x19/0x30

gf84 kernel: [ 4752.949504] Code: 11 8b 54 c6 5c 8b 44 c6 58 89 44 cf 58 89 54 cf 5c 41 39 cb 77 e5 ba 00 00 00 00 89 d0 83 c5 04 5b 5e 5f c3 57 56 53 8b 7c 24 10 <8b> 47 54 ba 00 00 00 00 8b 5c 24 14 8b 74 24 18 01 c3 11 d6 83

I hope it can help.

BTW, any clue on the possible error?

VÃctor

Topic		Replies	Views
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1705	January 6, 2009
intermittent killer kernel Kernel which causes CUDA to die, followed by launch failures CUDA Programming and Performance	36	35129	June 12, 2009
Multiple simultaneous CUDA applications (system crash on 100.14.11) CUDA Programming and Performance	14	12662	October 8, 2007
Maximum Threads for Kernel Call CUDA Programming and Performance	38	16671	May 25, 2010
cuda 2.2 bug? CUDA Programming and Performance	29	19825	May 3, 2010
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3561	March 10, 2011
random kernel execution failure with unknown error CUDA programming on Linux CUDA Programming and Performance	9	8723	June 11, 2008
Two 8800 GTX cards with Intel Core 2 Duo would this work? CUDA Programming and Performance	19	13175	October 2, 2007
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10112	January 18, 2012
GPU kernel hangs CUDA Programming and Performance	3	2967	January 29, 2009

CUDA 2.0 seems to fail for long executions multiple process on one card fail

Related topics