Mastering Multi-turn Conversations: Optimizing LLM Performance for Better Reliability

Interesting research: lost in conversation. All models get lost easily ...

Pretty fresh research covering a lot of ‘latest’ models. The loss is pretty stark across models. I do feel the method is a little artificial - but the ‘problem/conversations’ are exactly PhD level either, so I’m a little surprised. It comes with everything on GitHub so you can do your own testing as well.

Conclusion: In this work, we conduct a large-scale simulation of single- and multi-turn conversations with LLMs, and find that on a fixed set of tasks, LLM performance degrades significantly in multi-turn, underspecified settings. LLMs get lost in conversation, which materializes as a significant decrease in reliability as models struggle to maintain context across turns, make premature assumptions, and over-rely on their previous responses. Additional experiments reveal that known remediations that work for simpler settings (such as agent-like concatenation or decreasing temperature during generation) are ineffective in multi-turn settings, and we call on LLM builders to prioritize the reliability of models in multi-turn settings.

Addressing the Issue

Yes. I’ve created a system to address this thoroughly by treating a “conversation” instead as a “world state of data” (including documents through RAG, agents, orchestrators, multi-threaded parallel LLM-to-LLM calls, etc.). Thus the “world state” is managed by treating all assistant messages (i.e. LLM responses) as “input data” as well as all other system logs, documents retrieved, user messages, etc. Thus the user and LLM work together to “build a world state” that is the “dynamically updated context window”, and that data is contained within a single role: user message instead of a string of back-and-forth assistant/user messages.

This of course, means that you are entering a “non conversational mode” and instead maintaining a “set of data, instructions, and current-logical-state of tasks/goals/requirements, etc.”, all of which both the user and the LLM (through structured responses parsed by the system) carefully manage that world state.

The purpose of creating this system was to manage very-large-context-windows and continue project development/research flows without the need to manually reset context windows, etc. and provide an essentially “infinite multi-turn experience”. Of course this is a bit different than certain use cases, in which we seamlessly shift back into the “normal conversation” (sequential set) when appropriate for discussion, etc. But for all other tasks, the “three dimensional, non-sequential, world-state context window”, seems to be much more cutting edge and leverage the true LLM’s capacity by interacting with it appropriately given its “stateless nature and inherent recency bias when exposed to multi-turn conversational data”.

Managing your Hackathon Project - Unicorn-Style | Transform ...

Project Progress

I’ve shared some screenshots and current work on this project in the “Hackathon” thread within this forum, and consider it to be a work-in-progress, but only a few weeks away now from fully functioning and deployable implementation. If you know anyone would like to test it...