Cuda error

reidreidaa · June 17, 2021, 2:46am

CUDA:1 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:2 (GeForce GTX TITAN X, 12212.8125MB)

driver :455.38
docker :nvcr.io/nvidia/pytorch 21.03-py3

I train with the two gpu . But it cant train.
Does the two gpus can be used at the same time ?

reidreidaa · June 23, 2021, 7:11am

Traceback (most recent call last):
File “train.py”, line 541, in
train(hyp, opt, device, tb_writer)
File “train.py”, line 304, in train
loss, loss_items = compute_loss(pred, targets.to(device)) # loss scaled by batch_size
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at …/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fd165a3e5cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7fd165a04d4e in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x987 (0x7fd165a7f6f7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x5c (0x7fd165a244cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x29a (0x7fd1b2b3bd7a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x1c4 (0x7fd1b2b31444 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7fd1b2b642c6 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x22 (0x7fd1b2b697f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xc700e5 (0x7fd1b2b680e5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x6ff782 (0x7fd1b25f7782 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x700743 (0x7fd1b25f8743 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0x12b785 (0x5565291cb785 in /opt/conda/bin/python)
frame #14: + 0x1ca984 (0x55652926a984 in /opt/conda/bin/python)
frame #15: + 0x11f906 (0x5565291bf906 in /opt/conda/bin/python)
frame #16: + 0x12bc96 (0x5565291cbc96 in /opt/conda/bin/python)
frame #17: + 0x12bc4c (0x5565291cbc4c in /opt/conda/bin/python)
frame #18: + 0x154ec8 (0x5565291f4ec8 in /opt/conda/bin/python)
frame #19: PyDict_SetItemString + 0x87 (0x5565291f6127 in /opt/conda/bin/python)
frame #20: PyImport_Cleanup + 0x9a (0x5565292f65aa in /opt/conda/bin/python)
frame #21: Py_FinalizeEx + 0x7d (0x5565292f694d in /opt/conda/bin/python)
frame #22: Py_RunMain + 0x110 (0x5565292f77f0 in /opt/conda/bin/python)
frame #23: Py_BytesMain + 0x39 (0x5565292f7979 in /opt/conda/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fd1e12bf0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x1e7185 (0x556529287185 in /opt/conda/bin/python)

Robert_Crovella · June 23, 2021, 10:49am

https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

reidreidaa · June 23, 2021, 12:34pm

When training , it will happen

Topic		Replies	Views
the launch timed out and was terminated. CUDA Programming and Performance	6	23846	June 29, 2010
multi GPUs programming error CUDA Programming and Performance	4	891	June 4, 2013
Cuda out of memory error CUDA Programming and Performance	1	1050	December 13, 2023
question about "launch timed out" CUDA Programming and Performance	2	1389	April 24, 2009
Fatal error:the launch timed out and was terminated CUDA Programming and Performance	5	9768	April 19, 2016
problem with more data CUDA Programming and Performance	1	10423	October 29, 2011
GPU CUDA problem: CUDA grid launch failed error on windows CUDA Programming and Performance	2	1746	November 10, 2017
CUDA on Fedora 10: Unable to compile CUDA Programming and Performance	6	7415	May 17, 2009
CUDA kernel timeout CUDA Programming and Performance	12	58794	December 22, 2022
Error 719 (failure to launch) for JCUDA and PyCUDA; How to run GPU consecutive times for 'large' data blocks CUDA Programming and Performance	0	2332	December 13, 2016

Cuda error

Related topics