[BUG] cgf helloworld sample socket communication inconsistency behavior in mailbox+reuse connection

Required Info:

  • Software Version
    [ x ] DRIVE OS 6.0.8.1
  • Target OS
    [x ] Linux
  • SDK Manager Version
    [x ] 1.9.2.10884
  • Host Machine Version
    [x ] native Ubuntu Linux 20.04 Host installed with DRIVE OS DOCKER Containers

Describe the bug

with helloworld and sum source code provided by driveworks-5.14 release sample, nv_driveworks/driveworks-5.14/samples/src/cgf_nodes/HelloWorldNode.hpp at main Β· ZhenshengLee/nv_driveworks Β· GitHub , nv_driveworks/driveworks-5.14/samples/src/cgf_nodes/SumNodeImpl.cpp at main Β· ZhenshengLee/nv_driveworks Β· GitHub
and the official cgf helloworld demo guide nv_driveworks/drive-agx-orin-doc/3-drive-works/CGF-presentation.pdf at main Β· ZhenshengLee/nv_driveworks Β· GitHub , we run the minimal test for the connection of mailbox and reuse, and found the unexpected behavior.

To Reproduce

  1. build a cgf app with helloworld and sum process.
  2. for helloworld, the hyperepoch is 10ms*10, for sum, the hyperepoch is 20ms*5, the upper stream is twice faster than downstream.
  3. in sumNode, add checking logic that in every epoch, the sumNode never receive the repeated data/msg.
  4. compile and run the cgf app, check the data consistency

for step2, the code is in the following
app.json

{
    "name": "DWCGFHelloworldApp",
    "logSpec": "file/rfc5424:{{logpath}}/{{appname}}.log",
    "parameters": {},
    "requiredSensors": "../../../extra/appCommon/DWCGFImagePipe.required-sensors.json",
    "sensorMappingLookups": [
        "../../../extra/sensor_mappings"
    ],
    "subcomponents": {
        "top": {
            "componentType": "./DWCGFHelloworld.graphlet.json",
            "parameters": {}
        }
    },
    "connections": [],
    "states": {
        "STANDARD": {
            "stmScheduleKey": "standardSchedule",
            "default": true
        }
    },
    "stmSchedules": {
        "standardSchedule": {
            "wcet": "./DWCGFHelloworld_wcet.yaml",
            "hyperepochs": {
                "helloworldHyperepoch": {
                    "period": 100000000,
                    "epochs": {
                        "helloworldEpoch": {
                            "period": 10000000,
                            "frames": 10,
                            "passes": [
                                [
                                    "top.helloWorldNode"
                                ]
                            ]
                        }
                    },
                    "resources": {
                        "machine0.CPU1": []
                    }
                },
                "sumHyperepoch": {
                    "period": 100000000,
                    "epochs": {
                        "sumEpoch": {
                            "period": 20000000,
                            "frames": 5,
                            "passes": [
                                [
                                    "top.sumNode"
                                ]
                            ]
                        }
                    },
                    "resources": {
                        "machine0.CPU2": []
                    }
                }
            }
        }
    },
    "processes": {
        "ssm": {
            "runOn": "machine0",
            "executable": "vanillassm",
            "logSpec": "file:{{logpath}}/{{appname}}.log"
        },
        "schedule_manager": {
            "runOn": "machine0",
            "executable": "ScheduleManager",
            "argv": {
                "--enableScheduleSwitching": "false",
                "--scheduleManagerHostIP": "127.0.0.1",
                "--scheduleManagerHostPort": "40100",
                "--scheduleManagerNumClients": "1"
            }
        },
        "stm_master": {
            "runOn": "machine0",
            "executable": "stm_master",
            "logSpec": "file:{{logpath}}/{{appname}}.log",
            "argv": {
                "--allow-unregistered-runnables": true,
                "--timeout-us": "60000000",
                "--soc": "TegraA",
                "--core": "0",
                "--enable-memlock": true,
                "-m": true,
                "--log": "./LogFolder/team_node/Helloworld/stm.log",
                "--master-forked-log-path": "./LogFolder/team_node/Helloworld",
                "--epochs": "300"
            }
        },
        "helloworld_process0": {
            "runOn": "machine0",
            "executable": "LoaderLite",
            "subcomponents": [
                "top.helloWorldNode"
            ]
        },
        "sum_process0": {
            "runOn": "machine0",
            "executable": "LoaderLite",
            "subcomponents": [
                "top.sumNode"
            ]
        }
    },
    "extraInfo": "../../../extra/appCommon/DWCGFImagePipeExtraInfo.json"
}

graphlet.json

{
    "name": "DWCGFHelloworld",
    "inputPorts": {},
    "outputPorts": {},
    "parameters": {
        "paraName": { "type": "std::string", "default": "helloworld_name" }
    },
    "subcomponents": {
        "helloWorldNode": {
            "componentType": "../../../nodes/example/helloworld/HelloWorldNode.node.json",
            "parameters": {
                "name": "$paraName"
            }
        },
        "sumNode": {
            "componentType": "../../../nodes/example/helloworld/SumNode.node.json"
        }
    },
    "connections": [
        {
            "src": "helloWorldNode.VALUE_0",
            "dests": {
                "sumNode.VALUE_0": {
                    "mailbox": true,
                    "reuse": true,
                }
            }
        },
        {
            "src": "helloWorldNode.VALUE_1",
            "dests": {
                "sumNode.VALUE_1": {
                    "mailbox": true,
                    "reuse": true
                }
            }
        }
    ]
}

for step3, the code is in following

    if (inPort0.isBufferAvailable() && inPort1.isBufferAvailable())
    {
        auto inputValue0 = *inPort0.getBuffer();
        auto inputValue1 = *inPort1.getBuffer();
        DW_LOGD << "[Epoch " << m_epochCount << "]"
                << " Received " << inputValue0 << " from input VALUE_0"
                << ", received " << inputValue1 << " from input VALUE_1."
                << " Add together: " << (inputValue0 + inputValue1) << "!" << Logger::State::endl;
        if( m_lastValue0 == inputValue0)
        {
            DW_LOGD << "int32_t is identical with the last one" << Logger::State::endl;
        }
        m_lastValue0 = inputValue0;
        m_lastValue1 = inputValue1;
    }
    else
    {
        DW_LOGD << "[Epoch " << m_epochCount << "] inPort.buffer not available!" << Logger::State::endl;
    }

Expected behavior

run with 300 epochs, and check the log files, and it should never show the log int32_t is identical with the last one

Actual behavior

the log shows int32_t is identical with the last one, which means that the msg in the downstream sumNode is too late to update, and the old/repeated data was reused by downstream sumNode.

<13>1 2024-04-19T14:19:44.901606+08:00 - sum_process0 103896 - - [1713507555415171us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 299] Received 598 from input VALUE_0, received 9402 from input VALUE_1. Add together: 10000!
<13>1 2024-04-19T14:19:44.921565+08:00 - sum_process0 103896 - - [1713507555435129us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 300] Received 599 from input VALUE_0, received 9401 from input VALUE_1. Add together: 10000!
<13>1 2024-04-19T14:19:44.941576+08:00 - sum_process0 103896 - - [1713507555455140us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 301] Received 601 from input VALUE_0, received 9399 from input VALUE_1. Add together: 10000!
<13>1 2024-04-19T14:19:44.961625+08:00 - sum_process0 103896 - - [1713507555475189us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 302] Received 604 from input VALUE_0, received 9396 from input VALUE_1. Add together: 10000!
<13>1 2024-04-19T14:19:44.981537+08:00 - sum_process0 103896 - - [1713507555495100us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 303] Received 604 from input VALUE_0, received 9396 from input VALUE_1. Add together: 10000!
<13>1 2024-04-19T14:19:44.981562+08:00 - sum_process0 103896 - - [1713507555495128us][DEBUG][tid:38][SumNodeImpl.cpp:144][SumNode] int32_t is identical with the last one
<13>1 2024-04-19T14:19:45.001617+08:00 - sum_process0 103896 - - [1713507555515181us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 304] Received 608 from input VALUE_0, received 9394 from input VALUE_1. Add together: 10002!
<13>1 2024-04-19T14:19:45.021607+08:00 - sum_process0 103896 - - [1713507555535170us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 305] Received 609 from input VALUE_0, received 9391 from input VALUE_1. Add together: 10000!
<13>1 2024-04-19T14:19:45.041602+08:00 - sum_process0 103896 - - [1713507555555166us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 306] Received 611 from input VALUE_0, received 9389 from input VALUE_1. Add together: 10000!
<13>1 2024-04-19T14:19:45.061610+08:00 - sum_process0 103896 - - [1713507555575173us][DEBUG][tid:38][SumNodeImpl.cpp:138][SumNode] [Epoch 307] Received 613 from input VALUE_0, received 9387 from input VALUE_1. Add together: 10000!

the log tar files is uploaded here.
team_node.zip (189.0 KB)

Additional context

this inconsitency behavior doesn’t happen in shm-communication, when we put the helloworldNode and sumNode into a single process.

1 Like

@SivaRamaKrishnaNV Friendly ping for updates.

Dear @lizhensheng,
I will check repro this today and update you.
What I understand from the bug description is below is causing issue in config.

"mailbox": true,
  "reuse": true,
1 Like

yes, but I’d like to give you more details that:

  1. inter-process(socket)+mailbox&reuse : bug
  2. intra-process(shm)+mailbox&reuse: nonbug

More details

in app.json, put helloworld and sum in different processes, makes inter-process communication, which is socket by default.

app.json

        "helloworld_process0": {
            "runOn": "machine0",
            "executable": "LoaderLite",
            "subcomponents": [
                "top.helloWorldNode"
            ]
        },
        "sum_process0": {
            "runOn": "machine0",
            "executable": "LoaderLite",
            "subcomponents": [
                "top.sumNode"
            ]
        }

check the log that use the socket

<13>1 2024-04-19T14:19:32.974891+08:00 - sum_process0 103896 - - [1713507543488516us][DEBUG][tid:rr2_main][Runtime.cpp:652][Runtime] POST INIT BEGIN, SWC NAME : 
<13>1 2024-04-19T14:19:32.975103+08:00 - sum_process0 103896 - - [1713507543488726us][DEBUG][tid:25][Runtime.cpp:1300][Runtime] Runtime: create input channel for top.sumNode.VALUE_0 with parameter: type=SOCKET,fifo-size=1,id=18000,ip=127.0.0.1,mode=mailbox,reuse=true,role=consumer
<13>1 2024-04-19T14:19:32.975149+08:00 - sum_process0 103896 - - [1713507543488773us][DEBUG][tid:27][Runtime.cpp:1300][Runtime] Runtime: create input channel for top.sumNode.VALUE_1 with parameter: type=SOCKET,fifo-size=1,id=18001,ip=127.0.0.1,mode=mailbox,reuse=true,role=consumer
<13>1 2024-04-19T14:19:32.975197+08:00 - sum_process0 103896 - - [1713507543488820us][DEBUG][tid:26][ThreadPool.hpp:107][ThreadPool] Thread exit from ThreadPool: bindInput

in app.json, put helloworld and sum in the same process, makes intra-process communication, which is shm by default.

app.json

        "helloworld_process0": {
            "runOn": "machine0",
            "executable": "LoaderLite",
            "subcomponents": [
                "top.helloWorldNode",
                "top.sumNode"
            ]
        },
1 Like

Dear @lizhensheng,
I took a look at code snippet.
Your shared repo does not seems to have below files. I notice few more changes are needed. Please confirm if below are correct? If you can update the repo with same example, it would be great to repro easily with not many changes in config files.

"requiredSensors": "../../../extra/appCommon/DWCGFImagePipe.required-sensors.json",
    "sensorMappingLookups": [
        "../../../extra/sensor_mappings"
    ],

DWCGFImagePipe.required-sensors.json β†’ can be empty
sensorMappingLookups β†’ roadrunner-2.0/config/sensor_mappings ?

"stmSchedules": {
        "standardSchedule": {
            "wcet": "./DWCGFHelloworld_wcet.yaml",
            "hyperepochs": {

DWCGFHelloworld_wcet.yaml β†’ what did the wcet value used each pass? I can simply use 16000/32000.

 m_lastValue0 = inputValue0;
        m_lastValue1 = inputValue1;

Add m_lastValue0, m_lastValue1 variables in sumNodeImpl.hpp

Any other changes missing?

extra.zip (1.5 KB)

DWCGFHelloworld_wcet.zip (1.1 KB)

in SumNodeImpl.hpp, add private member.

private:
    // Passes functions
    dwStatus process();

    // node internal from cgf
    SumNodeParams m_params{};

    // node internal from loaderlite-stm
    dwContextHandle_t m_ctx{DW_NULL_HANDLE};

    // workload: config

    // workload: memory
    size_t m_epochCount{0};
    int32_t m_lastValue0{0};
    int32_t m_lastValue1{10000};

no other changes, the code added is just to check the data consistency.

I updated the SumNodeImpl.hpp/cpp. I hit a weird strange issue when using app.json snippet shared above. I renamed it as hellosum.app.json and updated few variable paths.

root@6.0.8.1-0006-build-linux-sdk:/usr/local/driveworks-5.14/tools/descriptionScheduleYamlGenerator# /usr/local/driveworks/tools/descriptionScheduleYamlGenerator/descriptionScheduleYamlGenerator.py --app /usr/local/driveworks/apps/helloWorldApp/hellosum.app.json --output ./hellosum__standardSchedule.yaml
Input Description file path: /usr/local/driveworks/apps/helloWorldApp/hellosum.app.json
Output schedule directory path: /usr/local/driveworks-5.14/tools/descriptionScheduleYamlGenerator/
=============================
Analysis...
Traceback (most recent call last):
  File "/usr/local/driveworks/tools/descriptionScheduleYamlGenerator/descriptionScheduleYamlGenerator.py", line 3123, in <module>
    main()
  File "/usr/local/driveworks/tools/descriptionScheduleYamlGenerator/descriptionScheduleYamlGenerator.py", line 3081, in main
    systemDescription = SystemDescription(appFile)
  File "/usr/local/driveworks/tools/descriptionScheduleYamlGenerator/descriptionScheduleYamlGenerator.py", line 2999, in __init__
    self.schedules = [
  File "/usr/local/driveworks/tools/descriptionScheduleYamlGenerator/descriptionScheduleYamlGenerator.py", line 3000, in <listcomp>
    ScheduleDescription(self.baseDir, stmSchedule, stmScheduleKey, identifier, stmScheduleKey == defaultStmScheduleKey, self)
  File "/usr/local/driveworks/tools/descriptionScheduleYamlGenerator/descriptionScheduleYamlGenerator.py", line 1437, in __init__
    self.subcomponents[graphletId] = GraphletDescription(graphletId, grapheletFile, self)
  File "/usr/local/driveworks/tools/descriptionScheduleYamlGenerator/descriptionScheduleYamlGenerator.py", line 2780, in __init__
    graphletJson = json.load(gf)
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 26 column 17 (char 693)

But line 26 column 17 is already enclosed in double quotes.

"helloworldHyperepoch": {
                    "period": 100000000

Updated hellosum.app.json

{
    "name": "DWCGFHelloworldApp",
    "logSpec": "file/rfc5424:{{logpath}}/{{appname}}.log",
    "parameters": {},
    "requiredSensors": "./extra/appCommon/DWCGFImagePipe.required-sensors.json",
    "sensorMappingLookups": [
        "./extra/sensor_mappings"
    ],
    "subcomponents": {
        "top": {
            "componentType": "./DWCGFHelloworld.graphlet.json",
            "parameters": {}
        }
    },
    "connections": [],
    "states": {
        "STANDARD": {
            "stmScheduleKey": "standardSchedule",
            "default": true
        }
    },
    "stmSchedules": {
        "standardSchedule": {
            "wcet": "./DWCGFHelloworld_wcet.yaml",
            "hyperepochs": {
                "helloworldHyperepoch": {
                    "period": 100000000,
                    "epochs": {
                        "helloworldEpoch": {
                            "period": 10000000,
                            "frames": 10,
                            "passes": [
                                [
                                    "top.helloWorldNode"
                                ]
                            ]
                        }
                    },
                    "resources": {
                        "machine0.CPU1": []
                    }
                },
                "sumHyperepoch": {
                    "period": 100000000,
                    "epochs": {
                        "sumEpoch": {
                            "period": 20000000,
                            "frames": 5,
                            "passes": [
                                [
                                    "top.sumNode"
                                ]
                            ]
                        }
                    },
                    "resources": {
                        "machine0.CPU2": []
                    }
                }
            }
        }
    },
    "processes": {
        "ssm": {
            "runOn": "machine0",
            "executable": "vanillassm",
            "logSpec": "file:{{logpath}}/{{appname}}.log"
        },
        "schedule_manager": {
            "runOn": "machine0",
            "executable": "ScheduleManager",
            "argv": {
                "--enableScheduleSwitching": "false",
                "--scheduleManagerHostIP": "127.0.0.1",
                "--scheduleManagerHostPort": "40100",
                "--scheduleManagerNumClients": "1"
            }
        },
        "stm_master": {
            "runOn": "machine0",
            "executable": "stm_master",
            "logSpec": "file:{{logpath}}/{{appname}}.log",
            "argv": {
                "--allow-unregistered-runnables": true,
                "--timeout-us": "60000000",
                "--soc": "TegraA",
                "--core": "0",
                "--enable-memlock": true,
                "-m": true,
                "--log": "./LogFolder/team_node/Helloworld/stm.log",
                "--master-forked-log-path": "./LogFolder/team_node/Helloworld",
                "--epochs": "300"
            }
        },
        "helloworld_process0": {
            "runOn": "machine0",
            "executable": "LoaderLite",
            "subcomponents": [
                "top.helloWorldNode"
            ]
        },
        "sum_process0": {
            "runOn": "machine0",
            "executable": "LoaderLite",
            "subcomponents": [
                "top.sumNode"
            ]
        }
    },
    "extraInfo": "./extra/appCommon/DWCGFImagePipeExtraInfo.json"
}

Do you see any issue in the json file?

Be careful with the path of the file, it you want to reuse our graph files. @SivaRamaKrishnaNV

β”œβ”€β”€ graphs
β”‚   β”œβ”€β”€ apps
β”‚   β”‚   β”œβ”€β”€ example
β”‚   β”‚   β”‚   β”œβ”€β”€ appHelloworld
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ DWCGFHelloworld.app.json
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ DWCGFHelloworld.graphlet.json
β”‚   β”‚   β”‚   β”‚   └── DWCGFHelloworld_wcet.yaml
β”‚   β”œβ”€β”€ extra
β”‚   β”‚   β”œβ”€β”€ appCommon
β”‚   β”‚   β”‚   β”œβ”€β”€ DWCGFImagePipeExtraInfo.json
β”‚   β”‚   β”‚   β”œβ”€β”€ DWCGFImagePipeInstanceIDs.json
β”‚   β”‚   β”‚   └── DWCGFImagePipe.required-sensors.json
β”‚   β”‚   └── sensor_mappings
β”‚   β”‚       └── pilot.sensor-mappings.json
β”‚   β”œβ”€β”€ nodes
β”‚   β”‚   β”œβ”€β”€ example
β”‚   β”‚   β”‚   └── helloworld
β”‚   β”‚   β”‚       β”œβ”€β”€ HelloWorldNode.node.json
β”‚   β”‚   β”‚       β”œβ”€β”€ MultipleNode.node.json
β”‚   β”‚   β”‚       β”œβ”€β”€ SubtractNode.node.json
β”‚   β”‚   β”‚       └── SumNode.node.json
└── src
    └── nodes
        β”œβ”€β”€ example
        β”‚   β”œβ”€β”€ helloworld
        β”‚   β”‚   β”œβ”€β”€ HelloWorldNode.cpp
        β”‚   β”‚   β”œβ”€β”€ HelloWorldNode.hpp
        β”‚   β”‚   β”œβ”€β”€ HelloWorldNodeImpl.cpp
        β”‚   β”‚   └── HelloWorldNodeImpl.hpp
        β”‚   └── sum
        β”‚       β”œβ”€β”€ SumNode.cpp
        β”‚       β”œβ”€β”€ SumNode.hpp
        β”‚       β”œβ”€β”€ SumNodeImpl.cpp
        β”‚       └── SumNodeImpl.hpp

Or you can just follow the official guide to create your own graphs and test the helloworld sample, which is in the official pdf, the copy is in nv_driveworks/drive-agx-orin-doc/3-drive-works/CGF-presentation.pdf at main Β· ZhenshengLee/nv_driveworks Β· GitHub

feel free to ping me if you repeate the bug @SivaRamaKrishnaNV , thanks.

force setting TCP_NODELAY to cgf apps resolves the issue. using this to hack the binary. sschroe/libnodelay: A small wrapper library that adds the TCP_NODELAY option for all sockets. (github.com)

@SivaRamaKrishnaNV Hope you the forum can reproduce the issue and be responsible.