Our application uses Golang and Cgo to call CUDA kernels. We have around 6 different kernels, each are called fairly regularly (in the order of tens to hundreds of times per second). There is an FFT, some Tensorflow and some other simple maths/memory operation kernels. Most are executed on their own stream with a streamSync() after launching. There are a few on the default stream which have deviceSync() after launching.
The app can run fine for days but occasionally it stops and the culprit seems to be one of the kernels. The golang stack traces indicate all the current Cgo calls are waiting on a syscall.
When the issue happens the GPU usage seems to get pegged at 55% and the following errors are found in Dmesg -
[11653.388292] nvgpu: 17000000.gv11b gk20a_fifo_tsg_unbind_channel_verify_status:2200 [ERR] Channel 504 to be removed from TSG 4 has NEXT set!
[11653.388551] nvgpu: 17000000.gv11b gk20a_tsg_unbind_channel:164 [ERR] Channel 504 unbind failed, tearing down TSG 4
[11653.390138] nvgpu: 17000000.gv11b gk20a_fifo_tsg_unbind_channel_verify_status:2200 [ERR] Channel 506 to be removed from TSG 3 has NEXT set!
[11653.390391] nvgpu: 17000000.gv11b gk20a_tsg_unbind_channel:164 [ERR] Channel 506 unbind failed, tearing down TSG 3
I have looked at the driver where these messages occur but I don’t know what a graphics channel is or what binding/unbinding means.
I am working on a test case to try to replicate the issue so I can profile it.
Can anyone suggest anything that might cause this issue?