Questions about language prompting, bimanual shared policies, and action chunk switching in GR00T N1.6 deployment

Hi NVIDIA team,

In a previous thread (Recommended long-horizon execution pattern for GR00T N1.6 with ZMQ), we asked about deploying GR00T N1.6 for a multi-stage manipulation task. The guidance we received was:

  • GR00T is conditioned on the language prompt provided in the observation; prompt wording should match the type of instructions used during training/fine-tuning.
  • For explicit multi-stage tasks, using an external Task Manager / Behavior Tree for skill-level re-prompting is usually the more controllable deployment pattern, since the N1.6 policy/ZMQ server is stateless at the task level.
  • The ZMQ server does not keep task-level memory across requests — each get_action() call should be treated as conditioned only on the current observation and the prompt included in that request.
  • If a skill succeeds before the action chunk finishes executing, it’s acceptable to discard the remaining chunk and request a new one (receding-horizon style), provided the downstream controller doesn’t keep executing stale queued commands and the next request uses the latest observation. Recommended pattern: send one subgoal prompt at a time, execute only the needed prefix of the returned chunk, and re-query at skill boundaries or when the scene state changes.

We’re now designing the external Task Manager in more detail and have a few follow-up questions, especially around language prompting as a per-request conditioning channel and how it interacts with bimanual/shared policies and action chunk execution.

1. Bimanual shared policy: can each arm receive a different prompt?

For a bimanual/shared 14-DOF policy, does a single get_action() request apply one unified language prompt to both arms jointly, or can each arm receive a different/independent prompt within the same observation?

For example:

  • left arm: “hold position”
  • right arm: “approach target”

This matters for deployment patterns where one arm needs to wait or hold while the other arm is still in a different subgoal phase.

Is asymmetric per-arm language conditioning supported for a joint bimanual policy, or does the policy require one unified instruction that describes both arms together?

2. Skill-level re-prompting if fine-tuning used only task-level language

If our fine-tuning dataset used a single static task-level language annotation, for example:

“pick up the cylinder”

but did not include sub-skill-level prompt variation, is it advisable to deploy skill-level re-prompting at inference time using new prompts such as:

  • “move to grasp position”
  • “close gripper and lift”
  • “move to place position”

Does the GR00T base pretraining provide enough language generalization for this kind of unseen prompt granularity, or would unseen sub-skill prompts risk degraded action quality?

In other words, if we want to use prompt-based phase signaling reliably, should our own fine-tuning dataset include matching sub-skill-level language annotations during training?

3. Discarding unfinished action chunks and immediately requesting a new one

In a supervised deployment, the FSM may detect that a subgoal has completed before the current action chunk has finished executing. In that case, the downstream controller may discard the remaining actions in the chunk and immediately request a new chunk with the next subgoal prompt.

Is there any known instability risk from this kind of abrupt context switch between consecutive chunks, especially since training data likely consists of natural continuous episodes rather than externally interrupted mid-motion chunks?

Are there any recommended best practices, such as:

  • minimum number of executed steps before discarding a chunk,
  • minimum dwell time per subgoal,
  • maximum re-planning frequency,
  • or a recommended ratio relative to the action chunk horizon, such as 200 steps?

We are trying to avoid jitter or unstable re-planning when subgoal transitions happen frequently.

4. Conditioning channels other than language prompt

Beyond the language prompt field, does the N1.6 observation schema or Policy API / ZMQ interface expose any other structured conditioning channel for task phase?

For example:

  • phase index,
  • task-stage one-hot,
  • custom metadata field,
  • explicit subgoal ID,
  • or other non-language conditioning input.

Or is the language prompt currently the only flexible conditioning channel exposed through the Policy API / ZMQ interface?

Why this matters

The answers to questions 1 and 2 directly affect our deployment and fine-tuning strategy. If per-arm prompting is supported, we may be able to simplify bimanual synchronization logic by using language conditioning for asymmetric arm phases. If skill-level prompting requires matching language granularity during fine-tuning, then we would need to redesign our dataset annotations before relying on inference-time sub-skill prompts.

Thank you again for your guidance.

Hello @chenjason7026,

Thanks for posting in the Isaac ROS forum!

For standard GR00T N1.6 Policy/ZMQ, one get_action() sample has one language input. The docs show language.task as shape (B, 1), and the N1.6 policy asserts only one language key / one language timestep. So the standard API does not support separate left-arm and right-arm prompts in the same joint bimanual request. Use one unified prompt describing both arms together.

For skill-level re-prompting, if fine-tuning only used task-level labels like “pick up the cylinder,” then using unseen sub-skill prompts such as “move to grasp position” or “close gripper and lift” is possible but not something I’d present as guaranteed reliable. For a source-grounded deployment, keep the language prompt consistent with training annotations, and handle phase transitions in your external application logic.

GR00T N1.6 returns an action horizon/chunk shaped (B, T, D), and the docs show both extracting only the first action and executing multiple chunk steps. So executing only a prefix of the returned chunk is consistent with the documented API. If a subgoal completes early, discarding the remaining actions and requesting a new chunk is a reasonable receding-horizon integration pattern, but the docs do not specify a required minimum executed steps, dwell time, or replanning ratio.

For other conditioning channels: in the standard N1.6 Policy/ZMQ API, the flexible learned conditioning channel is language plus the configured observation modalities. The client forwards observation and options, but Gr00tPolicy._get_action() marks options as currently unused. A custom phase index or one-hot stage would need to be added as a trained/registered modality, typically under state or another custom processor path, otherwise it will not condition the model.

Thanks for the detailed clarification. I have one follow-up question related to the point about custom phase/subgoal conditioning.

If we add a binary endpoint label in the dataset, for example skill_done = 0/1 where 1 marks the completion timestep of a subgoal, should we assume this is only treated as a dataset annotation and will not be returned by get_action()?

In other words, does the standard GR00T N1.6 Policy/ZMQ action chunk only contain the configured action modalities, unless skill_done is explicitly defined and trained as an output dimension or custom output head?

If we do want the policy to predict a completion signal, what is the recommended representation?

  • append it as an extra action dimension,

  • define it as a separate action modality/key,

  • add a custom output head,

  • or avoid this and keep subgoal completion detection external in the FSM?

For near-term deployment, should we treat endpoint / done labels as external supervision for the Task Manager, and rely on external state/perception checks for subgoal completion rather than expecting GR00T to output a termination signal?