FRCNN training failed on Linux, but ran without error on Windows. Why?


I’m training the model in
When i ran it on Ubuntu, it failed with errors.

Excerpt with memcheck:

========= Host Frame:/usr/lib/x86_64-linux-gnu/ [0xbd9e0]
========= Host Frame:/lib/x86_64-linux-gnu/ [0x76db]
========= Host Frame:/lib/x86_64-linux-gnu/ (clone + 0x3f) [0x12188f]

2019-05-09 06:23:23.615388: E tensorflow/stream_executor/cuda/] failed to enqueue async memcpy from host to device: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure; GPU dst: 0x7f32ceb0e800; host src: 0x248a6bc0; size: 5760000=0x57e400
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to “unspecified launch failure” on CUDA API call to cuEventQuery.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/ (cuEventQuery + 0x143) [0x2524a3]
========= Host Frame:/home/masatoshi/frcnn-from-scratch-with-keras/venv/lib/python3.6/site-packages/tensorflow/python/…/ (_ZN15stream_executor4cuda10CUDADriver10QueryEventEPNS0_11CudaContextEP10CUevent_st + 0x2b) [0xbe6c7b]
========= Host Frame:/home/masatoshi/frcnn-from-scratch-with-keras/venv/lib/python3.6/site-packages/tensorflow/python/…/ (_ZN15stream_executor4cuda9CUDAEvent13PollForStatusEv + 0x32) [0xbeff02]
========= Host Frame:/home/masatoshi/frcnn-from-scratch-with-keras/venv/lib/python3.6/site-packages/tensorflow/python/ (_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE + 0xa1) [0x77ed4b1]
2019-05-09 06:23:23.616785: E tensorflow/stream_executor/cuda/] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
========= Host Frame:/home/masatoshi/frcnn-from-scratch-with-keras/venv/lib/python3.6/site-packages/tensorflow/python/ (_ZN10tensorflow8EventMgr8PollLoopEv + 0xce) [0x77ed9fe]
========= Host Frame:/home/masatoshi/frcnn-from-scratch-with-keras/venv/lib/python3.6/site-packages/tensorflow/python/…/ (_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi + 0x306) [0x794dc6]
========= Host Frame:/home/masatoshi/frcnn-from-scratch-with-keras/venv/lib/python3.6/site-packages/tensorflow/python/…/ (_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data + 0x44) [0x793c84]
========= Host Frame:/usr/lib/x86_64-linux-gnu/ [0xbd9e0]
2019-05-09 06:23:23.616814: F tensorflow/core/common_runtime/gpu/] Unexpected Event status: 1
========= Host Frame:/lib/x86_64-linux-gnu/ [0x76db]
========= Host Frame:/lib/x86_64-linux-gnu/ (clone + 0x3f) [0x12188f]

========= Error: process didn’t terminate successfully
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x200612000

========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x200608000

========= No CUDA-MEMCHECK results found

The environment info:
Ubuntu Desktop 18.04.2 LTS
nvidia-driver-418/bionic,now 418.56-0ubuntu0~gpu18.04.1 amd64
nvidia-cuda-toolkit/bionic,now 9.1.85-3ubuntu1 amd64
libcudnn7/now amd64
Python 3.6.7
TensorFlow 1.13.1
Keras 2.2.4

But when I ran the same model on Windows 10, it ran without errors.

Windows Environment:
Windows 10 10.0.17763 build 17763
Python 3.6.8
TensorFlow 1.13.1
Keras 2.2.4

Self Resolution:

After clean installation of nvidia driver on fresh Ubuntu 18.04, nvidia-docker run causes no errors.

On Windows 10, there was an error reported by GPU Memory Test program also disappeared after fresh installation of OS and the driver.


Thanks for sharing. A quick check. For Windows 10, you mentioned there is an error. Is the error the same like the one reported in Ubuntu version (failed launch error)