LLM & Human-Machine Collaboration Reading Party: Cycle 0

Online, 2023

Introduction

Before the summer holiday, we successfully organized a month-long reading party focused on LLM and robotics (see Cycle 0: LLM and robotics). During this month, we passionately explored various applications of LLM in the field of robotics. From the Sept. 2023, we hope to revive the reading party. Our theme will continue to revolve around LLM and human-machine collaboration. To make our discussions more in-depth and specific, we will set a particular research direction each month. Each direction will be hosted by a PhD student studying the area, whose primary task will be to recommend and propose relevant outstanding papers. Subsequently, participating students will delve deep into these papers, sharing insights and engaging in discussions. Note: It’s acceptable if the paper doesn’t utilize LLM, as long as it falls under the topic and possesses potential for LLM-enhanced improvements.

The reading party is scheduled to be held every two weeks, with two students presenting each time. Each student will have 30 minutes to introduce and interpret the paper(s), followed by a 15-minute discussion and Q&A session. Given that some papers in the LLM field might be shorter or more straightforward, some students might need to present multiple papers within their allocated time slot. Therefore, we have set aside a total of 90 minutes for each weekly reading party, ensuring that every presentation receives ample time and attention. Furthermore, for each month, we will switch to a different specific research topic or direction. This ensures that students from various research backgrounds can benefit and gain insights from the sessions.

There are candidated topic list:

  • LLM for Reasoning
  • LLM for Game Theoretical Problem
  • LLM for Planning
  • LLM for Embodied AI
  • LLM for HCI
  • LLM for Science
  • LLM for Synthetic Data Generation
  • TBD

How to join?

Contacts: Yang Li

Email: yang.li-4@manchester.ac.uk

Homepage: liyang.page

Cycle 0: LLM and Robotics

Sharing 1: By Yang Li

Speaker: Yang Li, Phd Student at University of Manchester

Contents:

  • “No, to the Right” – Online Language Corrections for Robotic Manipulation via Shared Autonomy
    • Paper Link
    • Summary (by ChatGPT): This paper presents a framework for improving human-robot collaboration by enabling robots to understand and respond to real-time linguistic guidance during manipulation tasks. The authors demonstrate that incorporating online language corrections, such as “No, to the right” or “a bit higher,” can significantly enhance the performance and efficiency of the tasks, leading to more effective human-robot collaboration in various scenarios.
  • Large Language Models as Zero-Shot Human Models for Human-Robot Interaction
    • Paper Link
    • Summary (by ChatGPT): Human models are crucial for human-robot interaction (HRI), but creating accurate models is challenging due to the need for prior knowledge or extensive interaction data. This study explores the potential of large language models (LLMs) as zero-shot human models for HRI, with experiments demonstrating promising results and comparable performance to purpose-built models. Despite some limitations, integrating LLM-based human models into a social robot’s planning process can improve HRI scenarios, as shown through a trust-based table-clearing task and a robot utensil-passing experiment.

Slides Link.

Sharing 2: By Weiqin Zu

Speaker: Weiqin Zu, Master student at Shanghaitech University

Contents:

  • Chat with the Environment: Interactive Multimodal Perception using Large Language Models
    • Paper Link
    • Summary (by ChatGPT): Programming robot behavior in complex environments involves challenges ranging from dexterous low-level skills to high-level planning and reasoning. This study explores the use of pre-trained Large Language Models (LLMs) for zero-shot robotic planning and develops an interactive perception framework with an LLM as its backbone to instruct epistemic actions and reason over multimodal sensations. The findings demonstrate that LLMs can effectively control interactive robot behavior in a multimodal environment when grounded by multimodal modules that provide environmental context.
  • Code as Policies: Language Model Programs for Embodied Control
    • Paper Link
    • Summary (by ChatGPT): LLMs trained on code completion can be repurposed to write robot policy code based on natural language commands. By using few-shot prompting with example language commands and corresponding policy code, LLMs can autonomously generate new policy code that exhibits spatial-geometric reasoning, generalizes to new instructions, and prescribes precise values. This paper presents “code as policies,” a robot-centric formulation of LMPs, demonstrating reactive and waypoint-based policies across multiple real robot platforms. The approach also improves state-of-the-art performance on the HumanEval benchmark.

Slides Link.

Sharing 3: By Yang Li

Speaker: Yang Li, Phd Student at University of Manchester

Contents:

  • Guiding Pretraining in Reinforcement Learning with Large Language Models
    • Paper Link
    • Summary (by ChatGPT): The ELLM method leverages large-scale language models to guide reinforcement learning agents toward human-meaningful behaviors based on their current state. Evaluations in the Crafter game environment and Housekeep robotic simulator demonstrate improved common-sense behavior coverage and performance on downstream tasks.

Slides Link.

Sharing 4: By Shao Zhang

Speaker: Shao Zhang, Phd Student at Shanghai Jiaotong University

Contents:

  • Text2Motion: From Natural Language Instructions to Feasible Plans
    • Paper Link
    • Summary (by ChatGPT): Text2Motion is a language-based planning framework that enables robots to perform sequential manipulation tasks with long-horizon reasoning, using natural language instructions. By considering geometric dependencies across skill sequences and optimizing policy sequences, Text2Motion significantly outperforms previous language-based planning methods, achieving a 64% success rate on challenging problems. The framework shows promising generalization capabilities for semantically diverse tasks with dependencies between skills.

Slides Link.

Sharing 5: By Wenhao Zhang

Speaker: Wenhao Zhang, Undergraduate Student at Shanghai Jiaotong University

Contents:

  • Pre-Trained Language Models for Interactive Decision-Making
    • Paper Link
    • Summary (by ChatGPT): The study explores using pre-trained language models for general sequential decision-making problems, predicting actions through a policy network. Results show improved task completion rates and the importance of sequential input representations, suggesting LMs can aid learning and generalization beyond language processing tasks.
  • Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
    • Paper Link
    • Summary (by ChatGPT): Large language models can offer valuable semantic knowledge for robots executing high-level, natural language instructions, but lack real-world experience. By integrating pretrained skills and value functions, this approach grounds language models to specific physical environments, enabling robots to successfully perform complex, long-horizon tasks based on abstract instructions.

Slides Link.

Sharing 6: By Zhe Cao

Speaker: Zhe Cao, Undergraduate Student at Shanghai Jiaotong University

Contents:

  • Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
    • Paper Link
    • Summary (by ChatGPT): Large language models can be used to act in interactive environments by decomposing high-level tasks into mid-level plans when prompted correctly. However, their generated plans may not map directly to admissible actions, requiring conditioning on existing demonstrations and semantic translation to improve executability. Evaluations show promising results in extracting actionable knowledge from language models, despite a trade-off between executability and correctness.
  • Inner Monologue: Embodied Reasoning through Planning with Language Models
    • Paper Link
    • Summary (by ChatGPT): Large language models can be applied to robotic planning and interaction by leveraging environment feedback and forming an inner monologue for richer processing. By utilizing various feedback sources like success detection, scene description, and human interaction, closed-loop language feedback significantly improves high-level instruction completion in both simulated and real-world tasks, including long-horizon mobile manipulation tasks.

Slides Link.

Sharing 7: By Yudi Zhang

Speaker: Yudi Zhang, PhD Student at Eindhoven University of Technology

Contents:

  • LLM as A Robotic Brain: Unifying Egocentric Memory and Control
    • Paper Link
    • Summary (by ChatGPT): The LLM-Brain framework uses large-scale language models as a robotic brain to unify egocentric memory and control in embodied AI systems. By integrating multiple multimodal language models, LLM-Brain allows for closed-loop multi-round dialogues that encompass perception, planning, control, and memory. The framework is demonstrated through active exploration and embodied question answering tasks, showcasing its versatility and potential in various robotic applications.
  • Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control
    • Paper Link
    • Summary (by ChatGPT): Combining the semantic knowledge of large language models (LLMs) with grounded models of the environment, a guided decoding strategy is proposed for solving complex, long-horizon embodiment tasks in robotic settings. This approach addresses the challenges of applying LLMs to embodied agents by decoding action sequences that are both likely under the language model and realizable under grounded model objectives. The successful integration of LLMs and grounded models demonstrates the potential for more sophisticated language-conditioned robotic policies.

Slides Link.