[BUG] sample_cgf_dwchannel from dw5.14 failed with stuck in inter-process-socket communication

Required Info:

  • Software Version
    DRIVE OS 6.0.8.1
  • Target OS
    Linux
  • SDK Manager Version
    1.9.2.10884
  • Host Machine Version
    native Ubuntu Linux 20.04 Host installed with DRIVE OS DOCKER Containers

Describe the bug

running sample_cgf_dwchannel with command in docs Compute Graph Framework SDK Reference: CGF Channel Sample

To Reproduce

# t1
./sample_cgf_dwchannel --cons=1 --prod=0 --downstreams=0
# t2
./sample_cgf_dwchannel --prod=1 --downstreams=1 --cons=0

Expected behavior

the statistics of communication such as latency canbe shown.

Actual behavior

./sample_cgf_dwchannel --cons=1 --prod=0 --downstreams=0
[31-05-2024 02:23:48] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks/lib/libdw_base.so.5.14
[31-05-2024 02:23:48] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.14/targets/aarch64-Linux/lib/libdw_base.so.5.14
Creating channel with parameters: role=consumer,type=SOCKET,ip=127.0.0.1,id=40002,uid=1,fifo-size=10,connect-timeout=100000
Starting channel connection
./sample_cgf_dwchannel --prod=1 --downstreams=1 --cons=0
[31-05-2024 02:24:17] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks/lib/libdw_base.so.5.14
[31-05-2024 02:24:17] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.14/targets/aarch64-Linux/lib/libdw_base.so.5.14
Creating channel with parameters: role=producer,type=SOCKET,id=40002,fifo-size=10,ip=127.0.0.1,num-clients=1,producer-fifo=1,connect-timeout=1000000
Starting channel connection

Additional context

only the inter-process-socket communication failed, other transport seems to run very well

# socket
## inter-process-socket---BUG
./sample_cgf_dwchannel --cons=1 --prod=0 --downstreams=0
./sample_cgf_dwchannel --prod=1 --downstreams=1 --cons=0
./sample_cgf_dwchannel --cons=1 --prod=0 --downstreams=0 --dataType=custom
./sample_cgf_dwchannel --prod=1 --downstreams=1 --cons=0 --dataType=custom
## intra-process-socket
./sample_cgf_dwchannel --cons=1 --prod=1 --downstreams=1 --dataType=custom
./sample_cgf_dwchannel --cons=1 --prod=1 --downstreams=1 --dataType=int
./sample_cgf_dwchannel --cons=1 --prod=1 --downstreams=1 --dataType=dwImage
./sample_cgf_dwchannel --prod=1 --downstreams=2 --cons=2
# intra-inter-socket-BUG
./sample_cgf_dwchannel --cons=1 --prod=0 --downstreams=0
./sample_cgf_dwchannel --prod=1 --downstreams=2 --cons=1

# nvsci
## inter-process-nvsci
./sample_cgf_dwchannel --type=NVSCI --prod-stream-names=nvscisync_a_0 --prod-reaches=process --dataType=custom
./sample_cgf_dwchannel --type=NVSCI --cons-stream-names=nvscisync_a_1 --cons-reaches=process --dataType=custom
# inter-process-nvsci-async
./sample_cgf_dwchannel --type=NVSCI --prod-stream-names=nvscisync_a_0 --prod-reaches=process --sync-mode=p2c
./sample_cgf_dwchannel --type=NVSCI --cons-stream-names=nvscisync_a_1 --cons-reaches=process --sync-mode=p2c
## intra-process-nvsci
./sample_cgf_dwchannel --type=NVSCI --num-local-consumers=2
## intra-inter-nvsci
./sample_cgf_dwchannel --type=NVSCI --prod-stream-names=nvscisync_a_0 --prod-reaches=process --num-local-consumers=1
./sample_cgf_dwchannel --type=NVSCI --cons-stream-names=nvscisync_a_1 --cons-reaches=process

# shm
# inter-process-shm-notsupport
# ./sample_cgf_dwchannel --type=SHMEM_LOCAL --cons=1 --prod=0 --downstreams=0
# ./sample_cgf_dwchannel --type=SHMEM_LOCAL --prod=1 --downstreams=1 --cons=0
# intra-process-shm ONLY
./sample_cgf_dwchannel --type=SHMEM_LOCAL --cons=1 --prod=1 --downstreams=1 --dataType=custom
./sample_cgf_dwchannel --type=SHMEM_LOCAL --cons=1 --prod=1 --downstreams=1 --dataType=int

I could reproduce the issue. Checking if it is doc/SW issue and update you.

Friendly ping @SivaRamaKrishnaNV for updates.

Dear @lizhensheng ,
Could you try using timeout parameters.

Terminal 1:
nvidia@tegra-ubuntu:/usr/local/driveworks/bin$ ./sample_cgf_dwchannel --prod=1 --downstreams=2 --prod-timeout=10000
[03-06-2024 09:17:36] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/bin/../lib/libdw_base.so.5.16
[03-06-2024 09:17:36] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/targets/aarch64-Linux/lib/libdw_base.so.5.16
Creating channel with parameters: role=producer,type=SOCKET,id=40002,fifo-size=10,ip=127.0.0.1,num-clients=2,producer-fifo=1,connect-timeout=10000000
Creating channel with parameters: role=consumer,type=SOCKET,ip=127.0.0.1,id=40002,uid=1,fifo-size=10,connect-timeout=100000
Starting channel connection
All channels connected!
Producer uid: 0 done
Consumer uid: 1 done
Producer Send BandWidth: 548.043MB/s
Consumer Idx[1] Latency: 15079us
Consumer Idx[1] Recv BandWidth: 547.437MB/s

Terminal 2:
nvidia@tegra-ubuntu:/usr/local/driveworks/bin$ ./sample_cgf_dwchannel --cons=1 --prod=0 --downstreams=0 --cons-timeout=10000
[03-06-2024 09:17:34] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/bin/../lib/libdw_base.so.5.16
[03-06-2024 09:17:34] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/targets/aarch64-Linux/lib/libdw_base.so.5.16
Creating channel with parameters: role=consumer,type=SOCKET,ip=127.0.0.1,id=40002,uid=1,fifo-size=10,connect-timeout=10000000
Starting channel connection
All channels connected!
Consumer uid: 1 done
Consumer Idx[1] Latency: 10687us
Consumer Idx[1] Recv BandWidth: 550.024MB/s
nvidia@tegra-ubuntu:/usr/local/driveworks/bin$

Terminal 1: 
nvidia@tegra-ubuntu:/usr/local/driveworks/bin$ ./sample_cgf_dwchannel --cons=1 --prod=0 --downstreams=0 --dataType=custom --cons-timeout=10000
[03-06-2024 12:42:34] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/bin/../lib/libdw_base.so.5.16
[03-06-2024 12:42:34] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/targets/aarch64-Linux/lib/libdw_base.so.5.16
Creating channel with parameters: role=consumer,type=SOCKET,ip=127.0.0.1,id=40002,uid=1,fifo-size=10,connect-timeout=10000000
Starting channel connection
All channels connected!
Consumer uid: 1 done
Consumer Idx[1] Latency: 71us
Consumer Idx[1] Recv BandWidth: 0MB/s

Terminal 2: 
nvidia@tegra-ubuntu:/usr/local/driveworks/bin$ ./sample_cgf_dwchannel --prod=1 --downstreams=1 --cons=0 --dataType=custom --prod-timeout=10000
[03-06-2024 12:42:31] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/bin/../lib/libdw_base.so.5.16
[03-06-2024 12:42:31] Context::findResourcesPackageInPathWalk: Could not find ./resources/resources.pak in upto 7 parent directories from /usr/local/driveworks-5.16/targets/aarch64-Linux/lib/libdw_base.so.5.16
Creating channel with parameters: role=producer,type=SOCKET,id=40002,fifo-size=10,ip=127.0.0.1,num-clients=1,producer-fifo=1,connect-timeout=10000000
Starting channel connection
All channels connected!
Producer uid: 0 done
Producer Send BandWidth: 0MB/s

interprocess needs to use NVSCI or socket.

1 Like

Understood, and the default parameter is socket.

I checked your log, and the binary you run is from dw516, which is an internal version of dw.
Please repro this issue in do6.0.8.1, which has dw5.14.

And at least the issue has been resolved in dw5.16.