AI chatbots become dramatically less reliable in longer conversations, new study finds

(28 May 2025) A new study from Microsoft and Salesforce finds that even state-of-the-art AI language models become dramatically less reliable as conversations get longer and users reveal their requirements step by step. On average, the systems’ performance dropped by 39 percent in these scenarios.

To simulate how people actually interact with AI assistants, the researchers created a method called “sharding.” Instead of giving the model everything up front, they broke each task into smaller pieces—mirroring the way users typically specify what they want over the course of a conversation.

When the AI models had to handle these step-by-step instructions, their accuracy dropped from about 90 percent to just 51 percent. This steep decline was seen across all 15 models tested, from smaller open-source models like Llama-3.1-8B to big commercial systems like GPT-4o.

Even the top-tier models in the study—Claude 3.7 Sonnet, Gemini 2.5 Pro, and GPT-4.1—performed 30 to 40 percent worse in multi-round conversations compared to single-turn prompts. The models also became much less consistent, with results varying far more than in one-shot interactions.

The researchers identified four key problems: models frequently rush to conclusions prematurely, without having all the necessary details. They rely excessively on their own past (occasionally wrong) responses, neglect information from the middle of the discussion, and produce overly detailed responses, leading to incorrect assumptions about gaps in information.

Find out more here.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

AI chatbots become dramatically less reliable in longer conversations, new study finds

Related Articles