Hi NVIDIA team,
In a previous thread (Recommended long-horizon execution pattern for GR00T N1.6 with ZMQ), we asked about deploying GR00T N1.6 for a multi-stage manipulation task. The guidance we received was:
- GR00T is conditioned on the language prompt provided in the observation; prompt wording should match the type of instructions used during training/fine-tuning.
- For explicit multi-stage tasks, using an external Task Manager / Behavior Tree for skill-level re-prompting is usually the more controllable deployment pattern, since the N1.6 policy/ZMQ server is stateless at the task level.
- The ZMQ server does not keep task-level memory across requests — each
get_action()call should be treated as conditioned only on the current observation and the prompt included in that request. - If a skill succeeds before the action chunk finishes executing, it’s acceptable to discard the remaining chunk and request a new one (receding-horizon style), provided the downstream controller doesn’t keep executing stale queued commands and the next request uses the latest observation. Recommended pattern: send one subgoal prompt at a time, execute only the needed prefix of the returned chunk, and re-query at skill boundaries or when the scene state changes.
We’re now designing the external Task Manager in more detail and have a few follow-up questions, especially around language prompting as a per-request conditioning channel and how it interacts with bimanual/shared policies and action chunk execution.
1. Bimanual shared policy: can each arm receive a different prompt?
For a bimanual/shared 14-DOF policy, does a single get_action() request apply one unified language prompt to both arms jointly, or can each arm receive a different/independent prompt within the same observation?
For example:
- left arm: “hold position”
- right arm: “approach target”
This matters for deployment patterns where one arm needs to wait or hold while the other arm is still in a different subgoal phase.
Is asymmetric per-arm language conditioning supported for a joint bimanual policy, or does the policy require one unified instruction that describes both arms together?
2. Skill-level re-prompting if fine-tuning used only task-level language
If our fine-tuning dataset used a single static task-level language annotation, for example:
“pick up the cylinder”
but did not include sub-skill-level prompt variation, is it advisable to deploy skill-level re-prompting at inference time using new prompts such as:
- “move to grasp position”
- “close gripper and lift”
- “move to place position”
Does the GR00T base pretraining provide enough language generalization for this kind of unseen prompt granularity, or would unseen sub-skill prompts risk degraded action quality?
In other words, if we want to use prompt-based phase signaling reliably, should our own fine-tuning dataset include matching sub-skill-level language annotations during training?
3. Discarding unfinished action chunks and immediately requesting a new one
In a supervised deployment, the FSM may detect that a subgoal has completed before the current action chunk has finished executing. In that case, the downstream controller may discard the remaining actions in the chunk and immediately request a new chunk with the next subgoal prompt.
Is there any known instability risk from this kind of abrupt context switch between consecutive chunks, especially since training data likely consists of natural continuous episodes rather than externally interrupted mid-motion chunks?
Are there any recommended best practices, such as:
- minimum number of executed steps before discarding a chunk,
- minimum dwell time per subgoal,
- maximum re-planning frequency,
- or a recommended ratio relative to the action chunk horizon, such as 200 steps?
We are trying to avoid jitter or unstable re-planning when subgoal transitions happen frequently.
4. Conditioning channels other than language prompt
Beyond the language prompt field, does the N1.6 observation schema or Policy API / ZMQ interface expose any other structured conditioning channel for task phase?
For example:
- phase index,
- task-stage one-hot,
- custom metadata field,
- explicit subgoal ID,
- or other non-language conditioning input.
Or is the language prompt currently the only flexible conditioning channel exposed through the Policy API / ZMQ interface?
Why this matters
The answers to questions 1 and 2 directly affect our deployment and fine-tuning strategy. If per-arm prompting is supported, we may be able to simplify bimanual synchronization logic by using language conditioning for asymmetric arm phases. If skill-level prompting requires matching language granularity during fine-tuning, then we would need to redesign our dataset annotations before relying on inference-time sub-skill prompts.
Thank you again for your guidance.