Cuda Unknown error (code error 30)

Hi,

I get a unknown error (with code 30) after execute thousands of kernels.

I put traces in the code and I saw that the error appears when I try to execute ‘cudaMalloc’ but I don’t understand this because I used ‘cudaMemGetInfo’ and all the memory was free.

Also I am executing my simulations with ‘cuda-memcheck’ and I can’t see any error. Finally this error provokes a crash of device and I need to reset the workstation.

Someone has been the same or a similar error?

Thank you.

If I execute the the simulations with a small number of kernel executions the simulations finish.

The following trace is with ‘cuda-memcheck’.

========= CUDA-MEMCHECK
----------------------------------------------------------
| SOFTWARE 0.5 -INIT                                     |
----------------------------------------------------------
Disabling prefiltering 10 1 100
INPUT file: tk_set_cf_bcc_p1000.mol2
Reference INPUT file: tk_query_bcc.mol2
Reference OUTPUT file: ./20161213164052_13832_test/tk_query_bcc_out.mol2

Reading reference molecule 1
OUTPUT file: ./20161213164052_13832_test/tk_set_cf_bcc_p1000_ref1_out

Reading reference molecule 2
OUTPUT file: ./20161213164052_13832_test/tk_set_cf_bcc_p1000_ref2_out
----------------------------------------------------------
| SOFTWARE 0.5 - FINISHED                             |
----------------------------------------------------------
========= Internal error (7)
========= No CUDA-MEMCHECK results found

The cuda-memcheck doesn’t get error but I can see a ‘Internal error (7)’, what does it mean?

“internal error” means something unexpected happened inside the program (here: cuda-memcheck), and that there is no public information about this. Mostly these are debugging hints for the developers.

Internal errors should not occur, and any instances encountered are usually worth reporting via the normal bug-reporting channels but you will need to submit a repro case with the bug report, which may be non-trivial in a case like this.

Yes,

It is a strange error… at the moment I don’t know to solve it. But the more strange is that the error doesn’t happen in the same point, it is random…

Hi,

I read this topic https://devtalk.nvidia.com/default/topic/459869/cuda-programming-and-performance/-quot-display-driver-stopped-responding-and-has-recovered-quot-wddm-timeout-detection-and-recovery-/

Could my problem be related to this?

I’ve seen that it is for Windows but… could there be the same problem with Linux?

I’m using Ubuntu 16.04 in my workstation.

Thank you.

Hi all

I encountered the same problem on linux as well. Even though it is hello world cuda program.

georgeliao@dw064:~/software_projects/test_code/cuda_test$ cuda-memcheck ./a.out
========= CUDA-MEMCHECK
Hello World!
========= Internal error (7)
========= No CUDA-MEMCHECK results found

Anybody has any ideas? Could it be a bug from cuda-memcheck?

Thanks

Best regards
George Liao