Meta AI Researchers Introduced SWEET-RL and ...
Large language models (LLMs) are rapidly evolving into autonomous agents capable of handling complex tasks that demand reasoning, decision-making, and adaptability. These agents are utilized in various fields such as web navigation, personal assistance, and software development. To excel in real-world scenarios, these agents need to navigate multi-turn interactions that span multiple steps or decision points, requiring training methods that optimize the entire trajectory of interactions.
The Challenge with Multi-Turn Decision-Making
LLM-based agents face challenges in multi-turn decision-making, particularly in assigning proper credit to actions taken in earlier stages that impact later outcomes. Traditional training methods like next-token prediction or high-probability action imitation do not consider long-term dependencies or cumulative objectives, leading to inefficiencies in handling long-horizon tasks, especially in collaborative settings.
Introducing SWEET-RL and CollaborativeAgentBench
Researchers from FAIR at Meta and UC Berkeley have introduced a novel reinforcement learning method named SWEET-RL (Step-WisE Evaluation from Training-time Information). Along with SWEET-RL, they have also presented a benchmark called CollaborativeAgentBench or ColBench. ColBench offers over 10,000 training tasks and more than 1,000 test cases across backend programming and frontend design domains.

ColBench simulates realistic collaboration between an AI agent and a human partner, where agents must ask questions, refine their understanding, and iteratively provide solutions. In backend programming, agents write Python functions by seeking clarifications to refine specifications, while in frontend tasks, agents generate HTML code matching a visual target through feedback-based corrections, simulating real-world constraints.
The Success of SWEET-RL
SWEET-RL adopts an asymmetric actor-critic structure where the critic, with access to additional information during training, evaluates each decision made by the agent more precisely. This method simplifies credit assignment through the advantage function, enhancing the agent's learning of precise behaviors aligned with human expectations.
Key Findings and Takeaways
SWEET-RL demonstrated a 6% absolute improvement over other multi-turn reinforcement learning methods in both programming and design tasks. It outperformed existing methods in backend programming and frontend design tasks, showcasing its effectiveness in improving credit assignment, generalization, and scalability.

This research emphasizes the importance of precise, turn-by-turn feedback for training interactive agents effectively. By leveraging training-time information and an architecture-aligned optimization approach, SWEET-RL offers promising results for developing agents capable of reasoning, adapting, and collaborating over extended interactions.
Conclusion
The combination of SWEET-RL and the ColBench benchmark presents a robust foundation for advancing agent capabilities in real-world scenarios. This research opens up new possibilities for enhancing interactive agents' performance and adaptability across diverse tasks.
For further details, refer to the Paper, GitHub Page, and Dataset.