Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user’s input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human- robot interaction. Benefiting from a unified formulation of visual dialog and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate.
TiO is a unified transformer for all visual-language sub-tasks that ensemble interactive visual grounding. To do so, we (1) unify training on datasets from image captioning, visual question-answering (VQA), visual grounding (VG), and visual question generation (VQG); (2) unify prompts and predictions for multiple tasks; (3) unify the encoding and decoding of texts and bounding box coordinates in the tokenizer. Therefore, during inference, with the corresponding prompt as inputs, TiO can play the role of the Guessor, Oracle, or Questioner, with superior performance compared to baseline methods. Combined with robot grasping models (e.g. Segment Anything + Contact GraspNet), TiO can be deployed on the real-robot platform for interactive manipulation tasks robustly with natural language inputs.
We have deployed TiO on two real-robot platforms to evaluate the performance of interactive robotic manipulation. We apply a Kinova arm for manipulation and a RealSense camera for RGB and point cloud observation. The mobile platform is developed by our own team. TiO achieving the highest interactive grounding success rate of 86%. This result demonstrates the robustness and accuracy of TiO in understanding and disambiguating the user’s ambiguous request.
In order to more comprehensively evaluate the disambiguation ability of TiO, we propose a challenging disambiguation evaluation set, which contains 150 images from the test set of the InViG dataset, OpenImages, and objects365, 50 of which are sampled from the images containing human-related categories. Then, we divide it into 3 parts aim to evaluate the performance of models on understanding diversified visual concepts (Scene Understanding), human attributes & behaviors (Human Understanding), and language expressions (Language Understanding), which are usually required in open-ended HRI applications. For each one of them, we select 50 images and re-label the instructions. Examples of our evaluation benchmark for HRI experiments. Top row: Scene Understanding. Middle row: Human Understanding. Bottom row: Language Understanding.
Evaluation on InVig and Guesswhat?! benchmarks.
Evaluation on TiO Benchmark (10 volunteers). The following chart shows the disambiguation success rate and the results of human evaluation.
Real-world Evaluation. In this experiment, we used the Segment-Anything Model as the image segmenter and Contact-GraspNet as the grasp detector.
TiO is an interactive Grounding model that can ask questions to disambiguate. It is able to directly locate the object being described and proactively ask questions to eliminate ambiguity when there is any. Combined with segmentation models and grasping models, it enables mobile robots to complete most operations and interactive tasks.
@inproceedings{xu2023tio
author = {Xu, Jie and Zhang, Hanbo and Si, Qingyi and Li, Yifeng and Lan, Xuguang and Kong, Tao},
title = {Towards Unified Interactive Visual Grounding in The Wild},
year = {2023}
}