[BUG] sample_cgf_dwchannel failed with stuck in inter-process-nvsci-sync communication

Required Info:

  • Software Version
    DRIVE OS 6.0.6
  • Target OS
    Linux
  • SDK Manager Version
    1.9.2.10884
  • Host Machine Version
    native Ubuntu Linux 20.04 Host installed with DRIVE OS DOCKER Containers

Describe the bug

running sample_cgf_dwchannel with command in docs Compute Graph Framework SDK Reference: CGF Channel Sample failed.

To Reproduce

# t1
./sample_cgf_dwchannel --type=NVSCI --prod-stream-names=nvscisync_a_0 --prod-reaches=process --dataType=custom
# t2
./sample_cgf_dwchannel --type=NVSCI --cons-stream-names=nvscisync_a_1 --cons-reaches=process --dataType=custom

Expected behavior

the statistics of communication such as latency canbe shown.

Actual behavior

nvidia@tegra-ubuntu:~/zhensheng/orin_ws/nv_driveworks_demo/target/aarch64/install/bin/common_cgf_channel$ ./sample_cgf_dwchannel --type=NVSCI --prod-stream-names=nvscisync_a_0 --prod-reaches=process --dataType=custom
[11-07-2023 13:29:52] Platform: Detected Drive Orin P3710
[11-07-2023 13:29:52] TimeSource: monotonic epoch time offset is 1688958463806737
[11-07-2023 13:29:52] TimeSourceVibranteLinux: detect valid PTP interface mgbe2_0
[11-07-2023 13:29:52] TimeSource: Could not detect valid PTP time source at nvpps. Fallback to mgbe2_0
[11-07-2023 13:29:52] PTP Time is available from Eth Driver
[11-07-2023 13:29:52] Adding variable DW_Base:DW_Version
[11-07-2023 13:29:52] Added variable DW_Base:DW_Version
[11-07-2023 13:29:52] Platform: number of GPU devices detected 1
[11-07-2023 13:29:52] Platform: currently selected GPU device 0, Resource Data Dir: trt_08_05_10_03, Arch: ga10b
[11-07-2023 13:29:52] Platform: currently selected GPU device integrated ID 0
[11-07-2023 13:29:52] CUDLAEngine:getDLACount: CUDLA version is = 1003000
[11-07-2023 13:29:52] CUDLAEngine:getDLACount: Number of DLA devices = 2
[11-07-2023 13:29:52] Context::mountResourceCandidateDataPath resource FAILED to mount from './resources': VirtualFileSystem: Failed to mount './resources/resources.pak'
[11-07-2023 13:29:52] Context::mountResourceCandidateDataPath resource FAILED to mount from '/home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target/aarch64/install/bin/common_cgf_channel/data': VirtualFileSystem: Failed to mount '/home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target/aarch64/install/bin/common_cgf_channel/data/resources.pak'
[11-07-2023 13:29:52] Context::findDataRootInPathWalk data/DATA_ROOT found at: /usr/local/driveworks/data
[11-07-2023 13:29:52] Context::mountResourceCandidateDataPath resource FAILED to mount from '/usr/local/driveworks/data': VirtualFileSystem: Failed to mount '/usr/local/driveworks/data/resources.pak'
[11-07-2023 13:29:52] Context::findDataRootInPathWalk data/DATA_ROOT found at: /usr/local/driveworks-5.10/data
[11-07-2023 13:29:52] Context::mountResourceCandidateDataPath resource FAILED to mount from '/usr/local/driveworks-5.10/data': VirtualFileSystem: Failed to mount '/usr/local/driveworks-5.10/data/resources.pak'
[11-07-2023 13:29:52] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks/lib/libdw_base.so.5.10
[11-07-2023 13:29:52] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.10/targets/aarch64-Linux/lib/libdw_base.so.5.10
[11-07-2023 13:29:52] SDK: No resources(.pak) mounted, some modules will not function properly
[11-07-2023 13:29:52] egl::Display: found 1 EGL devices
[11-07-2023 13:29:52] egl::Display: use drm device: drm-nvdc
[11-07-2023 13:29:52] TimeSource: monotonic epoch time offset is 1688958463806737
[11-07-2023 13:29:52] TimeSourceVibranteLinux: detect valid PTP interface mgbe2_0
[11-07-2023 13:29:52] TimeSource: Could not detect valid PTP time source at nvpps. Fallback to mgbe2_0
[11-07-2023 13:29:52] PTP Time is available from Eth Driver
[11-07-2023 13:29:52] Initialize DriveWorks SDK v5.10.87
[11-07-2023 13:29:52] Release build with GNU 9.3.0 from buildbrain-branch-0-g9a5b4670e12 against Drive PDK v6.0.6.0
Creating channel with parameters: role=producer,type=NVSCI,ip=127.0.0.1,id=40002,num-clients=1,producer-fifo=1,fifo-size=4,timeout=1000,streamName=nvscisync_a_0,reach=process
[11-07-2023 13:29:52] thread loop started
[11-07-2023 13:29:52] event loop thread started
[11-07-2023 13:29:52]  Producer: Pool creation 
[11-07-2023 13:29:52]  Producer: Creating producer block 
[11-07-2023 13:29:52] Producer: Opening endpoint with name nvscisync_a_0 
!ERR![L:84]:nvsciipc_ipc_check_end: pid is not 0, but process doesn't exist, (pid:33276)
[11-07-2023 13:29:52]  Resetting Producer IPC Endpoints
[11-07-2023 13:29:52]  Source (UP) stream creation with ipcEndpoints, scisync, scibuf and ipcblocks
[11-07-2023 13:29:52] Connecting producer and multicast
[11-07-2023 13:29:52] Connecting multicast and IPC src
[11-07-2023 13:29:52] Add producer[0]: role=producer,type=NVSCI,ip=127.0.0.1,id=40002,num-clients=1,producer-fifo=1,fifo-size=4,timeout=1000,streamName=nvscisync_a_0,reach=process, current producer list size: 0
[11-07-2023 13:29:52] ChannelConnector: thread 281473060083968 starting producer and consumer connect threads 
[11-07-2023 13:29:52] registering pool block for connect notification
[11-07-2023 13:29:52] ChannelConnector: started producer and consumer threads. producer tid=281472878430464, consumer tid=281472886823168
[11-07-2023 13:29:52] registering producer block for connect notification
nvidia@tegra-ubuntu:~/zhensheng/orin_ws/nv_driveworks_demo/target/aarch64/install/bin/common_cgf_channel$ ./sample_cgf_dwchannel --type=NVSCI --cons-stream-names=nvscisync_a_1 --cons-reaches=process --dataType=custom
[11-07-2023 13:30:16] Platform: Detected Drive Orin P3710
[11-07-2023 13:30:16] TimeSource: monotonic epoch time offset is 1688958463806737
[11-07-2023 13:30:16] TimeSourceVibranteLinux: detect valid PTP interface mgbe2_0
[11-07-2023 13:30:16] TimeSource: Could not detect valid PTP time source at nvpps. Fallback to mgbe2_0
[11-07-2023 13:30:16] PTP Time is available from Eth Driver
[11-07-2023 13:30:16] Adding variable DW_Base:DW_Version
[11-07-2023 13:30:16] Added variable DW_Base:DW_Version
[11-07-2023 13:30:16] Platform: number of GPU devices detected 1
[11-07-2023 13:30:16] Platform: currently selected GPU device 0, Resource Data Dir: trt_08_05_10_03, Arch: ga10b
[11-07-2023 13:30:16] Platform: currently selected GPU device integrated ID 0
[11-07-2023 13:30:16] CUDLAEngine:getDLACount: CUDLA version is = 1003000
[11-07-2023 13:30:16] CUDLAEngine:getDLACount: Number of DLA devices = 2
[11-07-2023 13:30:16] Context::mountResourceCandidateDataPath resource FAILED to mount from './resources': VirtualFileSystem: Failed to mount './resources/resources.pak'
[11-07-2023 13:30:16] Context::mountResourceCandidateDataPath resource FAILED to mount from '/home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target/aarch64/install/bin/common_cgf_channel/data': VirtualFileSystem: Failed to mount '/home/nvidia/zhensheng/orin_ws/nv_driveworks_demo/target/aarch64/install/bin/common_cgf_channel/data/resources.pak'
[11-07-2023 13:30:16] Context::findDataRootInPathWalk data/DATA_ROOT found at: /usr/local/driveworks/data
[11-07-2023 13:30:16] Context::mountResourceCandidateDataPath resource FAILED to mount from '/usr/local/driveworks/data': VirtualFileSystem: Failed to mount '/usr/local/driveworks/data/resources.pak'
[11-07-2023 13:30:16] Context::findDataRootInPathWalk data/DATA_ROOT found at: /usr/local/driveworks-5.10/data
[11-07-2023 13:30:16] Context::mountResourceCandidateDataPath resource FAILED to mount from '/usr/local/driveworks-5.10/data': VirtualFileSystem: Failed to mount '/usr/local/driveworks-5.10/data/resources.pak'
[11-07-2023 13:30:16] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks/lib/libdw_base.so.5.10
[11-07-2023 13:30:16] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.10/targets/aarch64-Linux/lib/libdw_base.so.5.10
[11-07-2023 13:30:16] SDK: No resources(.pak) mounted, some modules will not function properly
[11-07-2023 13:30:16] egl::Display: found 1 EGL devices
[11-07-2023 13:30:16] egl::Display: use drm device: drm-nvdc
[11-07-2023 13:30:16] TimeSource: monotonic epoch time offset is 1688958463806737
[11-07-2023 13:30:16] TimeSourceVibranteLinux: detect valid PTP interface mgbe2_0
[11-07-2023 13:30:16] TimeSource: Could not detect valid PTP time source at nvpps. Fallback to mgbe2_0
[11-07-2023 13:30:16] PTP Time is available from Eth Driver
[11-07-2023 13:30:16] Initialize DriveWorks SDK v5.10.87
[11-07-2023 13:30:16] Release build with GNU 9.3.0 from buildbrain-branch-0-g9a5b4670e12 against Drive PDK v6.0.6.0
Creating channel with parameters: role=consumer,type=NVSCI,ip=127.0.0.1,id=40002,timeout=1000,fifo-size=4,streamName=nvscisync_a_1,reach=process
[11-07-2023 13:30:16]  Consumer: Fifo Queue Create
[11-07-2023 13:30:16] thread loop started
[11-07-2023 13:30:16] event loop thread started
[11-07-2023 13:30:16]  Consumer: consumer block create
[11-07-2023 13:30:16] Consumer opening endpoint with name nvscisync_a_1 
!ERR![L:84]:nvsciipc_ipc_check_end: pid is not 0, but process doesn't exist, (pid:33226)
[11-07-2023 13:30:16]  Resetting consumer IPC Endpoint
[11-07-2023 13:30:16]  Consumer: IPC dest block create
[11-07-2023 13:30:16] Add consumer[0]: role=consumer,type=NVSCI,ip=127.0.0.1,id=40002,timeout=1000,fifo-size=4,streamName=nvscisync_a_1,reach=process, current consumer list size: 0
[11-07-2023 13:30:16] ChannelConnector: thread 281473734580480 starting producer and consumer connect threads 
[11-07-2023 13:30:16]  Consumer: Connect dest ipc block to consumer
[11-07-2023 13:30:16] ChannelConnector: started producer and consumer threads. producer tid=281473483507968, consumer tid=281473491900672
[11-07-2023 13:30:16] registering consumer block for connect notification

Additional context

nvsci_ipc config is good

nvidia@tegra-ubuntu:~/zhensheng/orin_ws/nv_driveworks_demo/target/aarch64/install/bin/common_cgf_channel$ cat /etc/nvsciipc.cfg | grep nvscisync_
INTER_PROCESS   nvscisync_a_0          nvscisync_a_1   16      24576
INTER_PROCESS   nvscisync_b_0          nvscisync_b_1   16      24576
INTER_PROCESS   nvscisync_c_0          nvscisync_c_1   16      24576
INTER_PROCESS   nvscisync_d_0          nvscisync_d_1   16      24576

Dear @lizhensheng,
It is a known issue reported in DRIVE OS 6.0.6 release notes(Bug # 3948392). It is fixed in next release.

@SivaRamaKrishnaNV thanks for your quick reply!

I’ve seen this in 6.0.6 release note.

3948392 CGF [New Issue] sample_cgf_dwchannel in inter-process nvscistream with
asynchronous mode fails, execution stuck

It’s different from this issue, because this issue is with sync mode, not asynchronous mode.

Could you double confirm this sync-mode-issue?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.