Verify openvla in robosuite

I want to verify openvla in robosuite,the following is my system info
robosuite 1.15.1
Ubuntu22.04
Jetson Agx orin(64G)
conda environment(python 3.10)

I wirte a simple code

import robosuite as suite
from robosuite.controllers import load_part_controller_config
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

local_model_path = "/home/yljy/jetson-containers/data/models/huggingface/models--openvla--openvla-7b/snapshots/31f090d05236101ebfc381b61c674dd4746d4ce0"

processor = AutoProcessor.from_pretrained(local_model_path, trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    local_model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to("cuda:0")

controller_config = load_part_controller_config(default_controller="IK_POSE")

robosuite_env = suite.make(
    "Lift",
    robots="Panda",
    has_renderer=True,
    has_offscreen_renderer=True,
    use_camera_obs=True,
    camera_names="frontview",
    camera_heights=640,
    camera_widths=480
)

obs = robosuite_env.reset()

prompt = "In: What action should the robot take to pick up the cube?\nOut:"

while(True):
    image = Image.fromarray(obs['frontview_image'])

    inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
    action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
    action[0:3] = action[0:3]*100 #because the sensitivity is 0.01
    action[6] = action[6]*2-1 #related to the gripper openvla is [0,1],robosuite is [-1,1]

    print(action)

    obs, reward, done, info = robosuite_env.step(action)
    robosuite_env.render()

I run the code to use the openvla to control the manipulator, but something goes wrong. By rights, the robotic arm should be moving down, but the action[2] that openvla outputs (i.e., the offset of the Z-axis) is always positive. Is it because the end coordinates in robosuite don’t match the end coordinates in openvla?

Hi,

Could you try to run the same image on a desktop environment?
This helps us to figure out if the issue is Jetson-specific.

Thanks.

I think it may be not, but I don’t know the reason of the question

Hi,

This helps us to verify if this issue is hardware-dependent or pure software issue.
Do you have an environment to try?

Thanks.

Hi,
I am also working on Open VLA model. May I ask what camera calibration do you use in your simulation like camera pose, camera matrix, focal length,…etc.

I’m currently working on a simulation project with the following setup:

  • Operating System: Ubuntu 22.04
  • ROS 2 Distribution: Humble
  • Motion Planning Framework: MoveIt 2
  • Simulation Environment: NVIDIA Isaac Sim 4.2.0
  • Programming Language: Python

I’m utilizing Isaac Sim’s built-in camera and require assistance with the following aspects of camera calibration:

  1. Intrinsic Parameters:
  • Determining the camera matrix
  • Identifying focal length
  1. Extrinsic Parameters:
  • Establishing the camera’s pose within the simulation environment
  1. Calibration Process:
  • Best practices for calibrating the simulated camera to ensure accurate data representation

I aim to adjust the camera settings appropriately to enhance the fidelity of my simulation.

Hi,

Is the question for the @15962063926?
If not, could you file a new topic for your question and we will find the corresponding team to check.

Thanks.

Dose Openvla work in you simulation environment? In fact, I don’t adjust the camera settings. Because the robosuite only has the frontview. I only set the camera_height and camera_width.

Hi,

We need to check with our internal team.
Will get back to you soon.

Thanks.

Hi @15962063926, here is a version with MimicGen integration - OpenVLA - NVIDIA Jetson AI Lab

Yes, you need to trace it through and make sure all the coordinate space transforms are in the correct reference frame vs what is expected, and to fine-tune the model. There are a lot of hyperparameters to adjust and experiment with.

I would also recommend not overly focusing just on OpenVLA-7B, as there have now been several other VLA’s out, and OpenVLA is regarded as more difficult to train and is in fact larger/slower at 7B size, whereas mini-VLA’s have been coming out that are smaller (for example EVLA Efficient Vision-Language-Action Models | by Paweł Budzianowski | K-Scale Labs or OpenPi)

1 Like

OpenVLA does not work on my environment. I think we need camera calibration of OpenVLA model.