Evaluating Foundation Model Robot Pose Estimation with Synthetic Data Generation

1. Introduction

Position and Orientation or "Pose" is a 4x4 matrix that defines the translation or "position" and rotation or "orientation" of an object. One reason to care about Robot Pose Estimation is because accurate prediction of the two pose matrices for the robot and an object, enables calculation of a "relative grasp" transform that describes how the robot should position itself to grasp the object successfully.
Block Diagram $$T_R^O = T_C^O \times T_R^C$$ $${T_C^R}^{-1} = T_R^C$$ If you can perform accurate robot pose estimation using a foundation model. You should be able to grasp items that the model wasn't trained on with robots the model wasn't trained on, by leveraging the powerful open-vocabulary capabilities of foundation models that were pretrained on massive datasets. The team that built FoundationPose had already proven it could work on household objects such as a mustard bottle and a driller. Proving that the foundation model, FoundationPose has this "Open-Vocabulary" capability on robot data it hadn't seen was my goal. I will briefly cover FoundationPose's architecture and training details, but for more, please refer to the original work: FoundationPose

2. Model Architecture

FoundationPose Architecture
2.1 Language-aided Synthetic Data Generation at Scale2.3 Pose Hypothesis Generation - Make pose guesses and refine them.
2.3.1 Pose Initialization2.3.2 Pose Refinement$$\mathcal{L}_{\text{refine}} = w_1 \left\| \Delta \mathbf{t} - \Delta \bar{\mathbf{t}} \right\|_2 + w_2 \left\| \Delta \mathbf{R} - \Delta \bar{\mathbf{R}} \right\|_2$$ 2.4 Pose Selection - Score the refined poses.$$\mathcal{L}(i^+, i^-) = \max \left( \mathbf{S}(i^-) - \mathbf{S}(i^+) + \alpha,\, 0 \right)$$

3. Synthetic Robot Data Generation

Now that we have an idea of how the model was trained, let's get back to what I actually did. In order to predict pose correctly, FoundationPose needs several data inputs:

In addition, for evaluation purposes we also need:

I generated this data inside Pybullet by setting up the robot and a virtual camera inside Pybullet. I defined the virtual camera through a view matrix and a projection amtrix. The view matrix defines the world coordinates to camera coordinates transform while the projection matrix defines the 3D camera coordinates to 2D image coordinates transform. I then took photos of the robot rotating around a sphere. Next, I calculated the grouth truth robot pose annotations by getting the world frame pose, and using the view matrix to transform to camera frame pose. I also used the projection matrix to transform to image coordinates in order to actually visualize the ground truth pose annotations on our image data.

  1. Rendering a Robot inside of Pybullet:
    robot_id = p.loadURDF(robot.urdf)
  2. Setting up a Virtual Camera and taking Images of the Robot:
    view_matrix = p.computeViewMatrix(camera_position, target_point, up_direction)
    projection_matrix = p.computeProjectionMatrixFOV(fov=60, aspect=(w/h), nearclip, farclip)
    image = p.getCameraImage(w, h, view_matrix, projection_matrix)
  3. Getting Grouth Truth Pose Annotations:
    robot_world_pose = p.getLinkState(robot_id, link_index)
    robot_camera_pose = view_matrix.T @ robot_world_pose
    pose_image_coordinates = projection_matrix.T @ robot_camera_pose

Pybullet Synthetic Data Generation

FoundationPose Architecture
RGB Mask Depth GT Pose

4. Evaluation

Now, with my Synthetic Data, all that was left to do was to run FoundationPose on and get the robot pose estimates for my synthetic data. Once I had these predictions, I then evaluated against the pose annotations I generated earlier. However, there was one last additional transform I needed to perform here to make the poses line up reasonably. After much searching, I found that Pybullet follows the OpenGL coordinate system which defines the Y and Z axes to the inverse of how OpenCV defines them. Our foundation model or FoundationPose follows the OpenCV system while our synthetic data follows the OpenGL system. I therefore had to run my annotations through one last transform where I inversed the Y and Z axes so our pose predictions and annotations lined up reasonably well. Finally, I evaluated the Translation component of the pose with Euclidean Distance and the Rotation component with Angular/Geodesic Error. I got reasonably good results with translation error being less than 1mm off and rotation being less than 1 degree off.
Franka Panda Demo
Rotation Error: 0.674°
Translation Error: 0.655 mm