Hello,
I am working an a streaming software on TX1/TX2 to do image processing from multiple cameras.
Im struggling with segmentation faults when when I try to access memory after cudaStreamAttachMemAsync.
This is how the program works:
The software as multiple processing stages where each stage works inside a thread and each thread has its own cuda stream associated with it.
The program impelemts a memory pool which holds the buffers for V4L2, allocated with cudaAllocManaged.
Now these managed memory buffers are presented to the camera in the “capture” module and forwarded it to the processing stages:
capture → processing1 → processing2 → gstreamer:appsrc
In each of the processing steps, the buffer is attached to the process stages’ cuda stream with cudaStreamAttachMemAsync and cudaStreamSynchronize. To make sure processing has been finished before pushing it to the next stage, after each stage cudaStreamSynchronize is called.
Now it the capture module when I try to Queue the buffer, V4l2 complains about “Errno:14 Bad Address”; if I try to read from this buffer the program crashes with segmentation fault and following Kernel message:
Mar 15 10:32:05 swc238 kernel: [19912.166075] swstream[6253]: unhandled level 3 translation fault (11) at 0x11a420000, esr 0x92000007
Mar 15 10:32:05 swc238 kernel: [19912.175197] pgd = ffffffc07b700000
Mar 15 10:32:05 swc238 kernel: [19912.178595] [11a420000] *pgd=00000001df511003, *pud=00000001df511003, *pmd=00000001d16ed003, *pte=04e00001c0c4c712
Mar 15 10:32:05 swc238 kernel: [19912.189019]
Mar 15 10:32:05 swc238 kernel: [19912.190515] CPU: 3 PID: 6253 Comm: swstream Tainted: G O 4.4.38 #9
Mar 15 10:32:05 swc238 kernel: [19912.197851] Hardware name: quill (DT)
Mar 15 10:32:05 swc238 kernel: [19912.202675] task: ffffffc1e2c6f080 ti: ffffffc1516f0000 task.ti: ffffffc1516f0000
Mar 15 10:32:05 swc238 kernel: [19912.210182] PC is at 0x7f998dea90
Mar 15 10:32:05 swc238 kernel: [19912.213515] LR is at 0x7f98fa9f40
Mar 15 10:32:05 swc238 kernel: [19912.216843] pc : [<0000007f998dea90>] lr : [<0000007f98fa9f40>] pstate: 80000000
Mar 15 10:32:05 swc238 kernel: [19912.224251] sp : 0000007f3f67b5d0
Mar 15 10:32:05 swc238 kernel: [19912.227588] x29: 0000007f3f67b5d0 x28: 0000007f98392b80
Mar 15 10:32:05 swc238 kernel: [19912.232948] x27: 0000007f98392ae0 x26: 0000007f9900cc80
Mar 15 10:32:05 swc238 kernel: [19912.238300] x25: 0000007f999801b8 x24: 0000007f9900cc78
Mar 15 10:32:05 swc238 kernel: [19912.243655] x23: 000000011a420000 x22: 0000007f9900c000
Mar 15 10:32:05 swc238 kernel: [19912.249005] x21: 0000007f3f67b680 x20: 0000007f99033000
Mar 15 10:32:05 swc238 kernel: [19912.254354] x19: 0000007f3f67b6f0 x18: 0000000000000014
Mar 15 10:32:05 swc238 kernel: [19912.259718] x17: 0000007f998dea80 x16: 0000007f990341c8
Mar 15 10:32:05 swc238 kernel: [19912.265070] x15: 0000000000000020 x14: 2ce33e6c02ce33e7
Mar 15 10:32:05 swc238 kernel: [19912.270414] x13: 000000000000016d x12: 0000007f98398540
Mar 15 10:32:05 swc238 kernel: [19912.275779] x11: 0000000000000003 x10: 0101010101010101
Mar 15 10:32:05 swc238 kernel: [19912.281126] x9 : 0000007f20000020 x8 : 0101010101010101
Mar 15 10:32:05 swc238 kernel: [19912.286483] x7 : 7f7f7f7f7f7f7f7f x6 : 0000007f2000a828
Mar 15 10:32:05 swc238 kernel: [19912.291923] x5 : 0000000000000000 x4 : 0000000000000000
Mar 15 10:32:05 swc238 kernel: [19912.297286] x3 : 0000000000000020 x2 : 4aa81d63563f6800
Mar 15 10:32:05 swc238 kernel: [19912.302630] x1 : 0000000000000000 x0 : 000000011a420000
Mar 15 10:32:05 swc238 kernel: [19912.307978]
Mar 15 10:32:05 swc238 kernel: [19912.309469] Library at 0x7f998dea90: 0x7f99868000 /lib/aarch64-linux-gnu/libc-2.23.so
Mar 15 10:32:05 swc238 kernel: [19912.317303] Library at 0x7f98fa9f40: 0x7f98f49000 /home/user/soccerwatchrecorder/build/src/libsw_proc.so
Mar 15 10:32:05 swc238 kernel: [19912.326781] vdso base = 0x7f99b58000
Mar 15 10:32:05 swc238 kernel: [19912.813307] nvcsi 150c0000.nvcsi: csi4_cil_check_status (5) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.821242] nvcsi 150c0000.nvcsi: csi4_cil_check_status (5) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.901320] nvcsi 150c0000.nvcsi: csi4_cil_check_status (0) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.909252] nvcsi 150c0000.nvcsi: csi4_cil_check_status (0) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19912.989479] nvcsi 150c0000.nvcsi: csi4_stream_check_status (1) INTR_STATUS 0x00000004
Mar 15 10:32:05 swc238 kernel: [19912.997326] nvcsi 150c0000.nvcsi: csi4_stream_check_status (1) ERR_INTR_STATUS 0x00000004
Mar 15 10:32:05 swc238 kernel: [19913.005510] nvcsi 150c0000.nvcsi: csi4_cil_check_status (1) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:05 swc238 kernel: [19913.013428] nvcsi 150c0000.nvcsi: csi4_cil_check_status (1) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:06 swc238 kernel: [19913.093408] nvcsi 150c0000.nvcsi: csi4_stream_check_status (2) INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.101254] nvcsi 150c0000.nvcsi: csi4_stream_check_status (2) ERR_INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.109435] nvcsi 150c0000.nvcsi: csi4_cil_check_status (2) CIL_INTR_STATUS 0x00000046
Mar 15 10:32:06 swc238 kernel: [19913.117352] nvcsi 150c0000.nvcsi: csi4_cil_check_status (2) CIL_ERR_INTR_STATUS 0x00000046
Mar 15 10:32:06 swc238 kernel: [19913.197355] nvcsi 150c0000.nvcsi: csi4_stream_check_status (3) INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.205199] nvcsi 150c0000.nvcsi: csi4_stream_check_status (3) ERR_INTR_STATUS 0x00000008
Mar 15 10:32:06 swc238 kernel: [19913.213392] nvcsi 150c0000.nvcsi: csi4_cil_check_status (3) CIL_INTR_STATUS 0x00000064
Mar 15 10:32:06 swc238 kernel: [19913.221318] nvcsi 150c0000.nvcsi: csi4_cil_check_status (3) CIL_ERR_INTR_STATUS 0x00000064
Mar 15 10:32:06 swc238 kernel: [19913.301397] nvcsi 150c0000.nvcsi: csi4_cil_check_status (4) CIL_INTR_STATUS 0x00000044
Mar 15 10:32:06 swc238 kernel: [19913.309324] nvcsi 150c0000.nvcsi: csi4_cil_check_status (4) CIL_ERR_INTR_STATUS 0x00000044
The stream can run fine multiple hours before the crash appears and also the address that is faulting was used few hundred times before.
Also the same software runs fine on TX1 (cuda 8), the crash only apears on TX2(cuda 9).
[i]
The questions:
-
What is the cause of the translation fault? Is the buffer not ready to be proccessed by the CPU/v4l2?
-
I build my application for safeness before speed and use cudaStreamSynchronize a lot to make sure the managed memory is accesible from CPU at the beginning and end of each processing stage. Do you see a missconception here, which could let to this kind of bug?
-
How is V4L2 checking that the pointer is a “Bad Address” without causing a segmentation fault
-
Is it safe to create a workaround when the kind of translation faults happens?
[/i]