While advanced robots can follow natural instructions to manipulate everyday items, these systems still struggle with identifying and grasping specific parts of objects—like a teakettle’s handle—which is often necessary for completing the daily tasks they’re supposed to assist with.
“For a robot to be able to pick things up like a human would, it needs to reason about what object parts are relevant, ground the parts to its 3D visual perception, and plan for a sequence of part-specific manipulations,” explains Tianmin Shu, an assistant professor of computer and cognitive science at the Johns Hopkins University.
To enable this process, Shu’s research team has built the first large-scale dataset for training and evaluating language-guided, fine-grained robot manipulation models. The work includes PartInstruct, the first part-level instruction following benchmark, as well as a large training set that includes hundreds of annotated synthetic 3D objects and over a thousand part-level manipulation tasks. The team presented its research at this year’s Robotics: Science and Systems conference, held in Los Angeles June 21–25.
The researchers also generated over ten thousand expert demonstrations in a new robotic simulator called PartGym to perform a comprehensive evaluation of state-of-the-art robot manipulation approaches such as end-to-end vision-language policy learning and bi-level planning.
“While all the baselines we evaluated still struggled with fine-grained manipulation tasks—especially with generalizing to new objects or instructions—we found a few important factors for improving the robustness of manipulation performance,” Shu notes.
One discovery was that “bi-level planning models” significantly improved how well a robot performed fine-grained tasks. Using two modules—a high-level task planner, which creates skill commands, and a low-level action policy, which generates robot motion plans to complete the commands—these models break down complex tasks into simpler steps that each focus on a single object part.
Second, the team found that better visual representations, such as explicit 3D representations and part segmentations, can also provide a significant performance boost, highlighting the importance of part-level visual understanding and reasoning in a 3D space.
To continue research in this area, the researchers plan to scale up their dataset by adding more synthetic 3D object assets, including a broader range of tasks, and even collecting human demonstrations via virtual reality, all with the end goal of evaluating trained models with real-world robots.
“Developing this kind of fine-grained object manipulation capability will help developers engineer robots that are more useful in real-world scenarios like factories and warehouses, where they have to interact with many different objects and their parts in rich, complex ways,” Shu says. “Additionally, robots that understand natural language instruction will also better understand their human collaborators—for instance, when assembling furniture with a human, a robot must understand which part its human partner intends to install and how they plan to install it.”
With a diverse set of objects, tasks, and expert demonstrations, PartInstruct provides a foundation for training robots to handle objects with humanlike precision, the researchers say.
“Our work also highlights the gaps in current robot foundation models and provides new tools and resources for future research in this direction,” says Shu.
Additional authors of this work include CS PhD students Yifan Yin, Jianxin Wang, and Jiawei Peng; Bloomberg Distinguished Professor of Computational Cognitive Science Alan Yuille; Shivam Aarya, an undergraduate student of computer science and applied mathematics and statistics; visiting undergraduate student Zhengtao Han; and alumni Angtian Wang, Engr ’24 (PhD) and Shuhang Xu, Engr ’25 (MSE).