Crash after cudaStreamAttachMemAsync

Hello,

I am working an a streaming software on TX1/TX2 to do image processing from multiple cameras.

Im struggling with segmentation faults when when I try to access memory after cudaStreamAttachMemAsync.

This is how the program works:

The software as multiple processing stages where each stage works inside a thread and each thread has its own cuda stream associated with it.
The program impelemts a memory pool which holds the buffers for V4L2, allocated with cudaAllocManaged.
Now these managed memory buffers are presented to the camera in the “capture” module and forwarded it to the processing stages:

capture → processing1 → processing2 → gstreamer:appsrc

In each of the processing steps, the buffer is attached to the process stages’ cuda stream with cudaStreamAttachMemAsync and cudaStreamSynchronize. To make sure processing has been finished before pushing it to the next stage, after each stage cudaStreamSynchronize is called.

Now it the capture module when I try to Queue the buffer, V4l2 complains about “Errno:14 Bad Address”; if I try to read from this buffer the program crashes with segmentation fault and following Kernel message:

Mar 15 10:32:05 swc238 kernel: [19912.166075] swstream[6253]: unhandled level 3 translation fault (11) at 0x11a420000, esr 0x92000007
Mar 15 10:32:05 swc238 kernel: [19912.175197] pgd = ffffffc07b700000
Mar 15 10:32:05 swc238 kernel: [19912.178595] [11a420000] *pgd=00000001df511003, *pud=00000001df511003, *pmd=00000001d16ed003, *pte=04e00001c0c4c712
Mar 15 10:32:05 swc238 kernel: [19912.189019]
Mar 15 10:32:05 swc238 kernel: [19912.190515] CPU: 3 PID: 6253 Comm: swstream Tainted: G           O    4.4.38 #9
Mar 15 10:32:05 swc238 kernel: [19912.197851] Hardware name: quill (DT)
Mar 15 10:32:05 swc238 kernel: [19912.202675] task: ffffffc1e2c6f080 ti: ffffffc1516f0000 task.ti: ffffffc1516f0000
Mar 15 10:32:05 swc238 kernel: [19912.210182] PC is at 0x7f998dea90
Mar 15 10:32:05 swc238 kernel: [19912.213515] LR is at 0x7f98fa9f40
Mar 15 10:32:05 swc238 kernel: [19912.216843] pc : [<0000007f998dea90>] lr : [<0000007f98fa9f40>] pstate: 80000000
Mar 15 10:32:05 swc238 kernel: [19912.224251] sp : 0000007f3f67b5d0
Mar 15 10:32:05 swc238 kernel: [19912.227588] x29: 0000007f3f67b5d0 x28: 0000007f98392b80
Mar 15 10:32:05 swc238 kernel: [19912.232948] x27: 0000007f98392ae0 x26: 0000007f9900cc80
Mar 15 10:32:05 swc238 kernel: [19912.238300] x25: 0000007f999801b8 x24: 0000007f9900cc78
Mar 15 10:32:05 swc238 kernel: [19912.243655] x23: 000000011a420000 x22: 0000007f9900c000
Mar 15 10:32:05 swc238 kernel: [19912.249005] x21: 0000007f3f67b680 x20: 0000007f99033000
Mar 15 10:32:05 swc238 kernel: [19912.254354] x19: 0000007f3f67b6f0 x18: 0000000000000014
Mar 15 10:32:05 swc238 kernel: [19912.259718] x17: 0000007f998dea80 x16: 0000007f990341c8
Mar 15 10:32:05 swc238 kernel: [19912.265070] x15: 0000000000000020 x14: 2ce33e6c02ce33e7
Mar 15 10:32:05 swc238 kernel: [19912.270414] x13: 000000000000016d x12: 0000007f98398540
Mar 15 10:32:05 swc238 kernel: [19912.275779] x11: 0000000000000003 x10: 0101010101010101
Mar 15 10:32:05 swc238 kernel: [19912.281126] x9 : 0000007f20000020 x8 : 0101010101010101
Mar 15 10:32:05 swc238 kernel: [19912.286483] x7 : 7f7f7f7f7f7f7f7f x6 : 0000007f2000a828
Mar 15 10:32:05 swc238 kernel: [19912.291923] x5 : 0000000000000000 x4 : 0000000000000000
Mar 15 10:32:05 swc238 kernel: [19912.297286] x3 : 0000000000000020 x2 : 4aa81d63563f6800
Mar 15 10:32:05 swc238 kernel: [19912.302630] x1 : 0000000000000000 x0 : 000000011a420000
Mar 15 10:32:05 swc238 kernel: [19912.307978]
Mar 15 10:32:05 swc238 kernel: [19912.309469] Library at 0x7f998dea90: 0x7f99868000 /lib/aarch64-linux-gnu/libc-2.23.so
Mar 15 10:32:05 swc238 kernel: [19912.317303] Library at 0x7f98fa9f40: 0x7f98f49000 /home/user/soccerwatchrecorder/build/src/libsw_proc.so
Mar 15 10:32:05 swc238 kernel: [19912.326781] vdso base = 0x7f99b58000
Mar 15 10:32:05 swc238 kernel: [19912.813307] nvcsi 150c0000.nvcsi: csi4_cil_check_status (5) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.821242] nvcsi 150c0000.nvcsi: csi4_cil_check_status (5) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.901320] nvcsi 150c0000.nvcsi: csi4_cil_check_status (0) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.909252] nvcsi 150c0000.nvcsi: csi4_cil_check_status (0) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.989479] nvcsi 150c0000.nvcsi: csi4_stream_check_status (1) INTR_STATUS 0x00000004
Mar 15 10:32:05 swc238 kernel: [19912.997326] nvcsi 150c0000.nvcsi: csi4_stream_check_status (1) ERR_INTR_STATUS 0x00000004
Mar 15 10:32:05 swc238 kernel: [19913.005510] nvcsi 150c0000.nvcsi: csi4_cil_check_status (1) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19913.013428] nvcsi 150c0000.nvcsi: csi4_cil_check_status (1) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:06 swc238 kernel: [19913.093408] nvcsi 150c0000.nvcsi: csi4_stream_check_status (2) INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.101254] nvcsi 150c0000.nvcsi: csi4_stream_check_status (2) ERR_INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.109435] nvcsi 150c0000.nvcsi: csi4_cil_check_status (2) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:06 swc238 kernel: [19913.117352] nvcsi 150c0000.nvcsi: csi4_cil_check_status (2) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:06 swc238 kernel: [19913.197355] nvcsi 150c0000.nvcsi: csi4_stream_check_status (3) INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.205199] nvcsi 150c0000.nvcsi: csi4_stream_check_status (3) ERR_INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.213392] nvcsi 150c0000.nvcsi: csi4_cil_check_status (3) CIL_INTR_STATUS 0x00000064
Mar 15 10:32:06 swc238 kernel: [19913.221318] nvcsi 150c0000.nvcsi: csi4_cil_check_status (3) CIL_ERR_INTR_STATUS 0x00000064
Mar 15 10:32:06 swc238 kernel: [19913.301397] nvcsi 150c0000.nvcsi: csi4_cil_check_status (4) CIL_INTR_STATUS 0x00000044
Mar 15 10:32:06 swc238 kernel: [19913.309324] nvcsi 150c0000.nvcsi: csi4_cil_check_status (4) CIL_ERR_INTR_STATUS 0x00000044

The stream can run fine multiple hours before the crash appears and also the address that is faulting was used few hundred times before.

Also the same software runs fine on TX1 (cuda 8), the crash only apears on TX2(cuda 9).

[i]
The questions:

  1. What is the cause of the translation fault? Is the buffer not ready to be proccessed by the CPU/v4l2?

  2. I build my application for safeness before speed and use cudaStreamSynchronize a lot to make sure the managed memory is accesible from CPU at the beginning and end of each processing stage. Do you see a missconception here, which could let to this kind of bug?

  3. How is V4L2 checking that the pointer is a “Bad Address” without causing a segmentation fault

  4. Is it safe to create a workaround when the kind of translation faults happens?

[/i]

Hi,

You need to register the camera frame with EGL interface if you want to access it with GPU.

It’s recommended to check our MMAPI sample first.
There are lots of related examples. Ex. ${tegra_multimedia_api}/samples/v4l2cuda

Thanks.

Hi Aasta,

thank you for your answer but I cannot see how it is related to my problem. I am not using MMAPI, Im just using a combination of plain v4l2 with CUDA Managed Memory.

GPU access works fine, just CPU access is causing translation faults (sometimes).

I am sure that the software is conceptually working, as it do on TX1 but not on TX2.

The Bug is more about Managed Memory than EGL/MMAPI

Hi,
Please share a test code so that we can build and run. We will check with it works on TX1(r28.2) and does not work on TX2(r28.2.1)

Hi Dane,

Thank you for the offer. The code base I am working with its actually quite large and it will take some time to create a minimum test code to reproduce this error. But i will definitely do that, if I cannot figure out the error myself by the end of the week.

I still would be thankful for some answers of the questions in the initial post

Hi,
Not sure but maybe cuCtxSynchronize() is not called somewhere. Or maybe some APIs, such as cudaMallocManaged() or cudaMalloc(), are not working properly. We Still need to reproduce it first and then do further investigation.

Hi,
We have a sample which should be similar to your case:

tegra_multimedia_api\samples\v4l2cuda

It would be great if you can make a patch on it for reproducing the issue.

Hi, I just came back to say that the problem has been solved now. I can not be 100% sure but I think the Problem was, that I did not call cudaStreamSynchronize after CPU access and the automatic data migration of managed memory was not finished until I pushed the data to the next stream.

I actually have not the time to create a minimum example to show what was going on, but I hope that this hint can guide someone who is facing this as well

Hi,

Thanks for sharing this with you.
Good to know the issue fix now.

Thanks.