[BUG] cgf helloworld sample string1m_t POD datatype communication hangs with sigtimedwait() timeout

Required Info:

  • Software Version
    DRIVE OS 6.0.8.1
  • Target OS
    Linux
  • SDK Manager Version
    1.9.2.10884
  • Host Machine Version
    native Ubuntu Linux 20.04 Host installed with DRIVE OS DOCKER Containers

Describe the bug

with helloworld and sum source code provided by driveworks-5.14 release sample, nv_driveworks/driveworks-5.14/samples/src/cgf_nodes/HelloWorldNode.hpp at main · ZhenshengLee/nv_driveworks · GitHub , nv_driveworks/driveworks-5.14/samples/src/cgf_nodes/SumNodeImpl.cpp at main · ZhenshengLee/nv_driveworks · GitHub
and the official cgf helloworld demo guide https://github.com/ZhenshengLee/nv_driveworks/blob/main/drive-agx-orin-doc/3-drive-works/CGF-presentation.pdf , we run the minimal test for the connection of socket bigdata, and found the unexpected behavior.

To Reproduce

git clone -b bug/socket https://github.com/ZhenshengLee/nv_driveworks.git
# follow the README_en.md to compile and run helloworld cgf app
sudo ./bin/cgf_custom_nodes/example/runHelloworld.sh

the main points of the app is to change the datatype from int to string1m_t. refer the nv_driveworks/driveworks-5.14/samples/src/cgf_nodes/src/channel/std_msgs/String.hpp at bdc53323613d75806cfdabef569d3f4a9140516f · ZhenshengLee/nv_driveworks · GitHub

typedef dw::core::FixedString<32> string32_t;
typedef dw::core::FixedString<64> string64_t;
typedef dw::core::FixedString<128> string128_t;
typedef dw::core::FixedString<256> string256_t;
typedef dw::core::FixedString<512> string512_t;
typedef dw::core::FixedString<1024> string1k_t;
typedef dw::core::FixedString<1048576> string1m_t;
typedef dw::core::FixedString<2097152> string2m_t;

DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string32_t);
DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string64_t);
DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string128_t);
DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string256_t);
DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string512_t);
DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string1k_t);
DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string1m_t);
DWFRAMEWORK_DECLARE_PACKET_TYPE_POD(string2m_t);

Expected behavior

the cgf app run successfully and return 0, the log shows no errors.

Actual behavior

the launcher.log reports timedwait timeout error.

[STM][ERROR] sem_timedwait failed; errno: 4 (Interrupted system call)
[STM ERROR]:[av/stm/runtime/src/core/comm.c][stmReceiveStateData] [63]: recv failed to receive state data
[STM ERROR]:[av/stm/runtime/src/core/state.c][recvGlobalScheduleState] [701]: Could not receive updated schedule state from Master
av/stm/runtime/src/core/state.c:702 assertion failure, errno=4 (Interrupted system call)
[2024-08-24T17:25:49.538147Z][FATAL][tid:0][Launcher.cpp:1039][Launcher] Process sum_process0:761780 terminated by signal: 11 (Segmentation fault)
[2024-08-24T17:25:49.540877Z][FATAL][tid:0][Launcher.cpp:1039][Launcher] Process schedule_manager:761778 terminated by signal: 6 (Aborted)
[2024-08-24T17:25:49.579840Z][FATAL][tid:0][Launcher.cpp:1039][Launcher] Process ssm:761775 terminated by signal: 15 (Terminated)
[2024-08-24T17:26:19.579994Z][ERROR][tid:0][Launcher.cpp:1327][Launcher] sigtimedwait() timeout.
[2024-08-24T17:26:19.580128Z][ERROR][tid:0][Launcher.cpp:1384][Launcher] Killing all live child processes with SIGKILL...
[2024-08-24T17:26:19.921106Z][FATAL][tid:0][Launcher.cpp:1039][Launcher] Process stm_master:761779 terminated by signal: 9 (Killed)
[2024-08-24T17:26:19.988572Z][FATAL][tid:0][Launcher.cpp:1039][Launcher] Process helloworld_process0:761776 terminated by signal: 9 (Killed)
[2024-08-24T17:26:19.988600Z][INFO][tid:0][Launcher.cpp:1095][Launcher] waitForChildExit: No more child process!
[2024-08-24T17:26:19.988610Z][ERROR][tid:0][Launcher.cpp:1345][Launcher] All child processes has been killed successfully.
[2024-08-24T17:26:19.988625Z][FATAL][tid:0][Launcher.cpp:1565][Launcher] launcher exit status: 33
[2024-08-24T17:26:19.988694Z][DEBUG][tid:0][Launcher.cpp:1589][Launcher] swc_list.txt content:
line 1 : helloworld_process0,127.0.0.1
line 2 : multiple_process0,127.0.0.1
line 3 : sum_process0,127.0.0.1
line 4 : 

Additional context

If you checkout the last commit to use string32_t rather than string1m_t, the cgf app works well.

commit bdc53323613d75806cfdabef569d3f4a9140516f (HEAD -> bug/socket, github/bug/socket)
Author: lizhensheng <lzs_1993@qq.com>
Date:   Wed Aug 28 15:42:05 2024 +0800

    add helloworld string1m_t sample bug.

commit a0ebe76048f34b307e312244bc6429ae96dae8ac
Author: lizhensheng <lzs_1993@qq.com>
Date:   Wed Aug 28 14:46:14 2024 +0800

    add helloworld socket string32_t sample.

commit 954c31e3c176ff0d56c20111b385b859debebfc1 (github/main, main)

Dear @lizhensheng,
I will repro the issue and update you.
BTW any reason to use DRIVE OS 6.0.8.1 and not latest release?

Thanks for that!

the reason why we don’t upgrad from 6081 to 60100 is that:

  1. the 6081 was released 1 year ago, which is too long.
  2. the parking and pilot app has been heavily developed based on 6081.
  3. from the 60100 release note we don’t see critical bugfix for cgf or nvsci.

Feel free to provide to me, If you have any info that encourage us to upgrade to 60100.

@SivaRamaKrishnaNV friendly ping for updates.