Research

Which Agent Causes Task Failures and When?Researchers from PSU and Duke explores automated failure attribution of LLM Multi-Agent Systems

In recent years, LLM Multi-Agent systems have garnered widespread attention for their collaborative approach to solving complex problems. However, it's a common scenario for these systems to fail at a task despite a flurry of activity.

Share My Research is Synced’s column that welcomes scholars to share their own research breakthroughs with over 1.5M global AI enthusiasts. Beyond technological advances, Share My Research also calls for interesting stories behind the research and exciting research ideas. Contact us: chain.zhang@jiqizhixin.com

Meet the authors
Institutions: Penn State University, Duke University, Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University. The co-first authors are Shaokun Zhang of Penn State University and Ming Yin of Duke University.

In recent years, LLM Multi-Agent systems have garnered widespread attention for their collaborative approach to solving complex problems. However, it’s a common scenario for these systems to fail at a task despite a flurry of activity. This leaves developers with a critical question: which agent, at what point, was responsible for the failure? Sifting through vast interaction logs to pinpoint the root cause feels like finding a needle in a haystack—a time-consuming and labor-intensive effort.
 
This is a familiar frustration for developers. In increasingly complex Multi-Agent systems, failures are not only common but also incredibly difficult to diagnose due to the autonomous nature of agent collaboration and long information chains. Without a way to quickly identify the source of a failure, system iteration and optimization grind to a halt.
 
To address this challenge, researchers from Penn State University and Duke University, in collaboration with institutions including Google DeepMind, have introduced the novel research problem of “Automated Failure Attribution.” They have constructed the first benchmark dataset for this task, Who&When, and have developed and evaluated several automated attribution methods. This work not only highlights the complexity of the task but also paves a new path toward enhancing the reliability of LLM Multi-Agent systems.


The paper has been accepted as a Spotlight presentation at the top-tier machine learning conference, ICML 2025, and the code and dataset are now fully open-source.

Paper:https://arxiv.org/pdf/2505.00212
Code:https://github.com/mingyin1/Agents_Failure_Attribution
Dataset:https://huggingface.co/datasets/Kevin355/Who_and_When
 
 
Research Background and Challenges
LLM-driven Multi-Agent systems have demonstrated immense potential across many domains. However, these systems are fragile; errors by a single agent, misunderstandings between agents, or mistakes in information transmission can lead to the failure of the entire task.

Currently, when a system fails, developers are often left with manual and inefficient methods for debugging:
Manual Log Archaeology : Developers must manually review lengthy interaction logs to find the source of the problem.
Reliance on Expertise : The debugging process is highly dependent on the developer’s deep understanding of the system and the task at hand.
 
This “needle in a haystack” approach to debugging is not only inefficient but also severely hinders rapid system iteration and the improvement of system reliability. There is an urgent need for an automated, systematic method to pinpoint the cause of failures, effectively bridging the gap between “evaluation results” and “system improvement.”

Core Contributions
This paper makes several groundbreaking contributions to address the challenges above:
1. Defining a New Problem: The paper is the first to formalize “automated failure attribution” as a specific research task. This task is defined by identifying the failure-responsible agent and the decisive error step that led to the task’s failure.
2. Constructing the First Benchmark Dataset: Who&When : This dataset includes a wide range of failure logs collected from 127 LLM Multi-Agent systems, which were either algorithmically generated or hand-crafted by experts to ensure realism and diversity. Each failure log is accompanied by fine-grained human annotations for:
Who: The agent responsible for the failure.
When: The specific interaction step where the decisive error occurred.
Why: A natural language explanation of the cause of the failure.

3. Exploring Initial “Automated Attribution” Methods : Using the Who&When dataset, the paper designs and assesses three distinct methods for automated failure attribution:
– All-at-Once: This method provides the LLM with the user query and the complete failure log, asking it to identify the responsible agent and the decisive error step in a single pass. While cost-effective, it may struggle to pinpoint precise errors in long contexts.
– Step-by-Step: This approach mimics manual debugging by having the LLM review the interaction log sequentially, making a judgment at each step until the error is found. It is more precise at locating the error step but incurs higher costs and risks accumulating errors.
– Binary Search: A compromise between the first two methods, this strategy repeatedly divides the log in half, using the LLM to determine which segment contains the error. It then recursively searches the identified segment, offering a balance of cost and performance.


Experimental Results and Key Findings 
Experiments were conducted in two settings: one where the LLM knows the ground truth answer to the problem the Multi-Agent system is trying to solve (With Ground Truth) and one where it does not (Without Ground Truth). The primary model used was GPT-4o, though other models were also tested. The systematic evaluation of these methods on the Who&When dataset yielded several important insights:
A Long Way to Go: Current methods are far from perfect. Even the best-performing single method achieved an accuracy of only about 53.5% in identifying the responsible agent and a mere 14.2% in pinpointing the exact error step. Some methods performed even worse than random guessing, underscoring the difficulty of the task.
No “All-in-One” Solution: Different methods excel at different aspects of the problem. The All-at-Once method is better at identifying “Who,” while the Step-by-Step method is more effective at determining “When.” The Binary Search method provides a middle-ground performance.
 


Hybrid Approaches Show Promise but at a High Cost: The researchers found that combining different methods, such as using the All-at-Once approach to identify a potential agent and then applying the Step-by-Step method to find the error, can improve overall performance. However, this comes with a significant increase in computational cost.

– State-of-the-Art Models Struggle: Surprisingly, even the most advanced reasoning models, like OpenAI o1 and DeepSeek R1, find this task challenging.- This highlights the inherent difficulty of automated failure attribution, which demands a higher level of reasoning than what is required for more conventional tasks.
The Importance of Explicit Reasoning: Providing explicit prompts that require the LLM to explain its reasoning in the All-at-Once and Step-by-Step methods was shown to improve performance.


Context Length is a Limiting Factor: The study also revealed that as the context length of the failure logs increases, the performance of all attribution methods tends to decrease, with a more pronounced impact on the accuracy of identifying the error step.
Future Outlook: Paving the Way for More Reliable Multi-Agent Systems
“Automated failure attribution” is a crucial component in the development lifecycle of Multi-Agent systems. It has the potential to transform the challenge of identifying “what went wrong and who is to blame” from a perplexing mystery into a quantifiable and analyzable problem. By building a bridge between evaluation and improvement, we can ultimately create Multi-Agent systems that are more reliable, intelligent, and trustworthy.
 
 
 
 
 
 

About Synced

Machine Intelligence | Technology & Industry | Information & Analysis

146 comments on “Which Agent Causes Task Failures and When?Researchers from PSU and Duke explores automated failure attribution of LLM Multi-Agent Systems

  1. Michael Conlin

    Please re-subscribe me!!

  2. This research sounds fascinating! I loved how PSU and Duke worked together to figure out which AI agents cause problems in multi-agent systems. Automating failure detection like this could really help improve AI reliability. Great teamwork! 👏

  3. Fascinating topic! Understanding failure attribution in LLM Muilti-Agent systems is crucial for improving their reliability and making collaboration between agents more effective.

  4. Fifos Lilio

    I was looking for a reliable platform to handle my payments and gift cards smoothly, and that’s when I came across https://baxity.com/. Honestly, the whole experience was way easier than I expected. The site is clear, the process is fast, and I didn’t have to waste time figuring things out. What I liked most is how safe it feels to use — you can tell they really focus on security. If you need a trusted service, “baxity.com” is definitely worth checking out.

  5. annawonka

    Impressive! The research on “Automated Failure Attribution” not only solves a difficult practical problem in LLM Multi-Agent systems but also opens up new directions for improving the reliability Take Care Of Shadow Milk and optimization of complex AI systems.

  6. This post is pure gold for music lovers! Soundmap Todays Artist

  7. I think playing PolyTrack for fun is a good idea, right? What do you think about it?

  8. Vernon Conway

    The user navigates a tiny spacecraft through vibrant cosmic tunnels in Space Waves. The route is full with obstacles, curves, and turns that call for fast responses. Join right now and play this arcade game

  9. This article is perfect for unblocked fun! Hypackel Games Unblocked

  10. This article is a must for music fans! Soundmap Guesser

  11. xelafa

    Visiting a barbershop is more than just getting a haircut or shave—it’s a grooming ritual that helps men look sharp and feel confident. Professional barbers not only create the perfect hairstyle but also take care of facial skin and beard health using specialized products and techniques. Regular visits to the barbershop help maintain a polished appearance, reduce stress, and provide a moment of self-care that positively impacts overall image and well-being.
    https://barbarossanyc.com/

  12. In melon playground, every object follows realistic physics, letting you experiment with collisions, explosions, and unique chain reactions of destruction.

  13. This research is incredibly insightful! Its fascinating to see how automated failure attribution can tackle the complexities of LLM Multi-Agent Systems. The findings are both eye-opening and promising for the future of more reliable AI systems.speed stars unlock

  14. Pingback: MAROKO133 Breaking ai: Which Agent Causes Task Failures and When?Researchers from PSU and - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  15. This post is perfect for gaming vibes! Side Eye Emoji

  16. Pingback: TOPINDIATOURS Breaking ai: World-first 3D placentas print early tissue, test drugs, and re – TOPINDIATOURS

  17. Pingback: MAROKO133 Eksklusif ai: Which Agent Causes Task Failures and When?Researchers from PSU and - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  18. Pingback: BANDIT96 Update: Anker's Ultra-slim MagSafe power bank uses Qi2, and i 16320688 -

  19. Pingback: MAROKO133 Hot ai: ChatGPT attempts 2,400-year-old Plato problem, surprises with ‘learner-l - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  20. Pingback: MAROKO133 Update ai: Which Agent Causes Task Failures and When?Researchers from PSU and Du - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  21. Pingback: MAROKO133 Breaking ai: Panasonic’s anode-free EV battery could give 90-mile boost to Tesla - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  22. Pingback: TOPINDIATOURS Hot ai: Which Agent Causes Task Failures and When?Researchers from PSU and D – TOPINDIATOURS

  23. Pingback: MAROKO133 Hot ai: ByteDance Introduces Astra: A Dual-Model Architecture for Autonomous Rob - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  24. Darryl

    Thanks for this list, it brings back memories. Looking at some of these intense thrillers, you can’t help but wonder how a true professional would handle the situation. A master of stealth and strategy could probably resolve the entire conflict for the right amount of bloodmoney without anyone ever knowing they were there. Fun to think about!

  25. Pingback: TOPINDIATOURS Update ai: Which Agent Causes Task Failures and When?Researchers from PSU an – TOPINDIATOURS

  26. This post rocks for retro fun! Side Eye Emoji

  27. Así como la investigación en sistemas multi-agente busca identificar fallos de manera automatizada para mejorar la confiabilidad, en https://mifonesepportal.com.mx/
    los usuarios encuentran un espacio seguro y organizado para gestionar su información y pagos sin complicaciones.

  28. This research on automated failure attribution in LLM Multi-Agent Systems is truly eye-opening! Its fascinating to see how PSU and Duke are tackling the complex issue of identifying which agent causes task failures and when. The idea of automating this process instead of relying on manual debugging methods like log archaeology is a game-changer. The experimental results, especially the comparison of different attribution methods, provide valuable insights into the challenges and potential solutions. Its encouraging to see that even state-of-the-art models are finding this task difficult, highlighting the need for further innovation. This research paves the way for more reliable and trustworthy AI systems, and Im excited to see how these advancements will shape the future of multi-agent systems!speed stars unlock

  29. Pingback: MAROKO133 Update ai: China’s massive 240-ton electric trucks showcase batteries can rival - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  30. Such a cool platform for retro fans! Hypackel New Website

  31. Pingback: MAROKO133 Hot ai: Waymo Says There’s a Perfect Reasonable Explanation for Its Car Roaming - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  32. Pingback: MAROKO133 Update ai: James Webb observes a sunless world that produces surprisingly intens - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  33. Pingback: MAROKO133 Update ai: Cosmic mystery deepens as astronomers uncover most distant, powerful - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  34. Pingback: TOPINDIATOURS Eksklusif ai: Which Agent Causes Task Failures and When?Researchers from PSU – TOPINDIATOURS

  35. Pingback: MAROKO133 Update ai: Adobe Research Unlocking Long-Term Memory in Video World Models with - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  36. Pingback: MAROKO133 Hot ai: Which Agent Causes Task Failures and When?Researchers from PSU and Duke - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  37. Pingback: MAROKO133 Hot ai: Qwen's new Deep Research update lets you turn its reports into webp - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  38. Pingback: MAROKO133 Update ai: US deploys 9,000-ton destroyer that can control airspace, track subma - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  39. Pingback: MAROKO133 Breaking ai: Anthropic rolls out Claude AI for finance, integrates with Excel to - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  40. Violette Lu

    This is super cool! Debugging multi-agent systems is such a pain. This research could save devs tons of time. I wonder if it could help with issues I saw while using that online protractor for my design project!

  41. Thanks for sharing this!

  42. Pingback: MAROKO133 Hot ai: Light bends matter: Scientists find laser-triggered shifts in Janus 2D s - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  43. Pingback: Which Agent Causes Task Failures and When?Researchers from PSU and Duke explores automate… - AI Chronicle

  44. Pingback: MAROKO133 Hot ai: Guests Kicked to the Curb In the Middle of Their Stay After Airbnb Rival - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  45. Pingback: MAROKO133 Update ai: ByteDance Introduces Astra: A Dual-Model Architecture for Autonomous - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  46. Pingback: MAROKO133 Eksklusif ai: Programmable soft materials unlock asymmetric motion for next-gen - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  47. Loving this platform for classic fun! Side Eye Emoji

  48. Today’s artist was tricky—this helped a ton! Soundmap Guesser Spreadsheet

  49. Pingback: MAROKO133 Hot ai: AI denial is becoming an enterprise risk: Why dismissing “slop” obscures - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

Leave a Reply

Your email address will not be published. Required fields are marked *