USB data package corruption when using GPU

We have a workstation with an NVIDIA Quadro RTX 5000.

We use a USB device to stream data from an external unit. The streamed data contains a CRC for data integrity. If the system is otherwise idle, streaming data works without any issue.

We have observed that certain CUDA operations that happen concurrently with streaming data can cause data corruption of the streamed data (detected by the CRC). Specifically, we can cause corruption by running a loop of import tensorflow in Python processes – that is, each import and hence initialization of tensorflow happens in a new Python process. We can also cause corruption (less frequently) by running CUDA operations through tf.function code in a loop in a single Python process – that is, tensorflow is initialized once and then the same computation graph used multiple times. In the second case, the corruption happens at the start of cuda execution (maybe when allocating memory or starting jobs).

We have reduced the probability that it is a HW issue by connecting the USB device via different PCIe connections / cards. We’ve checked the raw data using usbmon to verify that it’s corrupt before it leaves the kernel.

Corruption is always one byte (i.e. not a corruption of a larger block). It’s not a single bit-flip.

Software versions:

  • Host OS: Ubuntu 20.04
  • Cuda processes are running in a Docker container: nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04
  • Host cuda driver: 510.47.03-1

Can anybody help us here?

You might wish to try the latest published GPU driver. Currently that appears to be 515.76. I’m not saying that the GPU driver you have has a problem, or that I know what the issue is, or anything like that. It’s just a typical test we would do at NVIDIA if we ran into something like this. Since you’re using CUDA 11.2, an additional optional test would be to try the first driver that supports that GPU and CUDA 11.2 version that you are using.

Beyond that, I’m not optimistic unless you/we can find a way to generate a reproducible test case that NVIDIA is likely to be able to create in our own lab.

@philip.haeusser

Are the CUDA operations working on the streaming data from the USB device? In the case that you mention that you can cause the error by running a loop of import tensorflowI failed to see the relation between the USB I/F and a program that is spawning tensorflow initialization under new processes

Is this the CRC that is part of the USB spec or an additional CRC you’ve added to the payload?

Does the system behave similarly if you use the non-CUDA version of Tensorflow?

@alclark

Thank you, very good questions.

tl;dr: CPU-only TF: no problem. GPU TF: yes problem.

Are the CUDA operations working on the streaming data from the USB device?

In the full system, yes. In our minimal example, no.

I failed to see the relation between the USB I/F and a program that is spawning tensorflow initialization under new processes

Exactly. Neither process is related.

Is this the CRC that is part of the USB spec or an additional CRC you’ve added to the payload?

Additional CRC added to the payload. USB has a CRC. We have not seen any CRC errors from the USB layer itself (which I guess would be reported by the kernel driver).

Does the system behave similarly if you use the non-CUDA version of Tensorflow?

With the non-CUDA version, there is no corruption.

@Robert_Crovella

We have indeed also tried the latest driver. Thank you for the hint. The problem persists, unfortunately.

@philip.haeusser

Thank you for the response. When you spawn the TensorFlow initialization under new processes, how are you checking the CRC simultaneously? Are you using a separate thread or application?

Is the corruption always the same byte location?

As @Robert_Crovella mentioned, we are unlikely to provide any meaningful input without the ability to reproduce the error on our end.

I suggest you focus on a way to help us replicate it or try to determine the relation between the CUDA functionality and the data from the USB I/F. To me it seems like there is a read/write race condition in the data handling