LLM & Human-Machine Collaboration Reading Party: Cycle 0
Online, 2023
Introduction
Before the summer holiday, we successfully organized a month-long reading party focused on LLM and robotics (see Cycle 0: LLM and robotics). During this month, we passionately explored various applications of LLM in the field of robotics. From the Sept. 2023, we hope to revive the reading party. Our theme will continue to revolve around LLM and human-machine collaboration. To make our discussions more in-depth and specific, we will set a particular research direction each month. Each direction will be hosted by a PhD student studying the area, whose primary task will be to recommend and propose relevant outstanding papers. Subsequently, participating students will delve deep into these papers, sharing insights and engaging in discussions. Note: It’s acceptable if the paper doesn’t utilize LLM, as long as it falls under the topic and possesses potential for LLM-enhanced improvements.
The reading party is scheduled to be held every two weeks, with two students presenting each time. Each student will have 30 minutes to introduce and interpret the paper(s), followed by a 15-minute discussion and Q&A session. Given that some papers in the LLM field might be shorter or more straightforward, some students might need to present multiple papers within their allocated time slot. Therefore, we have set aside a total of 90 minutes for each weekly reading party, ensuring that every presentation receives ample time and attention. Furthermore, for each month, we will switch to a different specific research topic or direction. This ensures that students from various research backgrounds can benefit and gain insights from the sessions.
There are candidated topic list:
- LLM for Reasoning
- LLM for Game Theoretical Problem
- LLM for Planning
- LLM for Embodied AI
- LLM for HCI
- LLM for Science
- LLM for Synthetic Data Generation
- TBD
How to join?
Contacts: Yang Li
Email: yang.li-4@manchester.ac.uk
Homepage: liyang.page
Cycle 0: LLM and Robotics
Sharing 1: By Yang Li
Speaker: Yang Li, Phd Student at University of Manchester
Contents:
- “No, to the Right” – Online Language Corrections for Robotic Manipulation via Shared Autonomy
- Paper Link
- Summary (by ChatGPT): This paper presents a framework for improving human-robot collaboration by enabling robots to understand and respond to real-time linguistic guidance during manipulation tasks. The authors demonstrate that incorporating online language corrections, such as “No, to the right” or “a bit higher,” can significantly enhance the performance and efficiency of the tasks, leading to more effective human-robot collaboration in various scenarios.
- Large Language Models as Zero-Shot Human Models for Human-Robot Interaction
- Paper Link
- Summary (by ChatGPT): Human models are crucial for human-robot interaction (HRI), but creating accurate models is challenging due to the need for prior knowledge or extensive interaction data. This study explores the potential of large language models (LLMs) as zero-shot human models for HRI, with experiments demonstrating promising results and comparable performance to purpose-built models. Despite some limitations, integrating LLM-based human models into a social robot’s planning process can improve HRI scenarios, as shown through a trust-based table-clearing task and a robot utensil-passing experiment.
Sharing 2: By Weiqin Zu
Speaker: Weiqin Zu, Master student at Shanghaitech University
Contents:
- Chat with the Environment: Interactive Multimodal Perception using Large Language Models
- Paper Link
- Summary (by ChatGPT): Programming robot behavior in complex environments involves challenges ranging from dexterous low-level skills to high-level planning and reasoning. This study explores the use of pre-trained Large Language Models (LLMs) for zero-shot robotic planning and develops an interactive perception framework with an LLM as its backbone to instruct epistemic actions and reason over multimodal sensations. The findings demonstrate that LLMs can effectively control interactive robot behavior in a multimodal environment when grounded by multimodal modules that provide environmental context.
- Code as Policies: Language Model Programs for Embodied Control
- Paper Link
- Summary (by ChatGPT): LLMs trained on code completion can be repurposed to write robot policy code based on natural language commands. By using few-shot prompting with example language commands and corresponding policy code, LLMs can autonomously generate new policy code that exhibits spatial-geometric reasoning, generalizes to new instructions, and prescribes precise values. This paper presents “code as policies,” a robot-centric formulation of LMPs, demonstrating reactive and waypoint-based policies across multiple real robot platforms. The approach also improves state-of-the-art performance on the HumanEval benchmark.
Sharing 3: By Yang Li
Speaker: Yang Li, Phd Student at University of Manchester
Contents:
- Guiding Pretraining in Reinforcement Learning with Large Language Models
- Paper Link
- Summary (by ChatGPT): The ELLM method leverages large-scale language models to guide reinforcement learning agents toward human-meaningful behaviors based on their current state. Evaluations in the Crafter game environment and Housekeep robotic simulator demonstrate improved common-sense behavior coverage and performance on downstream tasks.
Sharing 4: By Shao Zhang
Speaker: Shao Zhang, Phd Student at Shanghai Jiaotong University
Contents:
- Text2Motion: From Natural Language Instructions to Feasible Plans
- Paper Link
- Summary (by ChatGPT): Text2Motion is a language-based planning framework that enables robots to perform sequential manipulation tasks with long-horizon reasoning, using natural language instructions. By considering geometric dependencies across skill sequences and optimizing policy sequences, Text2Motion significantly outperforms previous language-based planning methods, achieving a 64% success rate on challenging problems. The framework shows promising generalization capabilities for semantically diverse tasks with dependencies between skills.
Sharing 5: By Wenhao Zhang
Speaker: Wenhao Zhang, Undergraduate Student at Shanghai Jiaotong University
Contents:
- Pre-Trained Language Models for Interactive Decision-Making
- Paper Link
- Summary (by ChatGPT): The study explores using pre-trained language models for general sequential decision-making problems, predicting actions through a policy network. Results show improved task completion rates and the importance of sequential input representations, suggesting LMs can aid learning and generalization beyond language processing tasks.
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
- Paper Link
- Summary (by ChatGPT): Large language models can offer valuable semantic knowledge for robots executing high-level, natural language instructions, but lack real-world experience. By integrating pretrained skills and value functions, this approach grounds language models to specific physical environments, enabling robots to successfully perform complex, long-horizon tasks based on abstract instructions.
Sharing 6: By Zhe Cao
Speaker: Zhe Cao, Undergraduate Student at Shanghai Jiaotong University
Contents:
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
- Paper Link
- Summary (by ChatGPT): Large language models can be used to act in interactive environments by decomposing high-level tasks into mid-level plans when prompted correctly. However, their generated plans may not map directly to admissible actions, requiring conditioning on existing demonstrations and semantic translation to improve executability. Evaluations show promising results in extracting actionable knowledge from language models, despite a trade-off between executability and correctness.
- Inner Monologue: Embodied Reasoning through Planning with Language Models
- Paper Link
- Summary (by ChatGPT): Large language models can be applied to robotic planning and interaction by leveraging environment feedback and forming an inner monologue for richer processing. By utilizing various feedback sources like success detection, scene description, and human interaction, closed-loop language feedback significantly improves high-level instruction completion in both simulated and real-world tasks, including long-horizon mobile manipulation tasks.
Sharing 7: By Yudi Zhang
Speaker: Yudi Zhang, PhD Student at Eindhoven University of Technology
Contents:
- LLM as A Robotic Brain: Unifying Egocentric Memory and Control
- Paper Link
- Summary (by ChatGPT): The LLM-Brain framework uses large-scale language models as a robotic brain to unify egocentric memory and control in embodied AI systems. By integrating multiple multimodal language models, LLM-Brain allows for closed-loop multi-round dialogues that encompass perception, planning, control, and memory. The framework is demonstrated through active exploration and embodied question answering tasks, showcasing its versatility and potential in various robotic applications.
- Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control
- Paper Link
- Summary (by ChatGPT): Combining the semantic knowledge of large language models (LLMs) with grounded models of the environment, a guided decoding strategy is proposed for solving complex, long-horizon embodiment tasks in robotic settings. This approach addresses the challenges of applying LLMs to embodied agents by decoding action sequences that are both likely under the language model and realizable under grounded model objectives. The successful integration of LLMs and grounded models demonstrates the potential for more sophisticated language-conditioned robotic policies.