AI Nature Language Tech Research Share My Research

Researchers from PSU and Duke introduce “Multi-Agent Systems Automated Failure Attribution

"Automated failure attribution" is a crucial component in the development lifecycle of Multi-Agent systems. It has the potential to transform the challenge of identifying "what went wrong and who is to blame" from a perplexing mystery into a quantifiable and analyzable problem

Share My Research is Synced’s column that welcomes scholars to share their own research breakthroughs with over 2M global AI enthusiasts. Beyond technological advances, Share My Research also calls for interesting stories behind the research and exciting research ideas. 

Meet the author
Institutions: Penn State University, Duke University, Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University. The co-first authors are Shaokun Zhang of Penn State University and Ming Yin of Duke University.

In recent years, LLM Multi-Agent systems have garnered widespread attention for their collaborative approach to solving complex problems. However, it’s a common scenario for these systems to fail at a task despite a flurry of activity. This leaves developers with a critical question: which agent, at what point, was responsible for the failure? Sifting through vast interaction logs to pinpoint the root cause feels like finding a needle in a haystack—a time-consuming and labor-intensive effort.
 
This is a familiar frustration for developers. In increasingly complex Multi-Agent systems, failures are not only common but also incredibly difficult to diagnose due to the autonomous nature of agent collaboration and long information chains. Without a way to quickly identify the source of a failure, system iteration and optimization grind to a halt.
 
To address this challenge, researchers from Penn State University and Duke University, in collaboration with institutions including Google DeepMind, have introduced the novel research problem of “Automated Failure Attribution.” They have constructed the first benchmark dataset for this task, Who&When, and have developed and evaluated several automated attribution methods. This work not only highlights the complexity of the task but also paves a new path toward enhancing the reliability of LLM Multi-Agent systems.
The paper has been accepted as a Spotlight presentation at the top-tier machine learning conference, ICML 2025, and the code and dataset are now fully open-source.

Paper:https://arxiv.org/pdf/2505.00212
Code:https://github.com/mingyin1/Agents_Failure_Attribution
Dataset:https://huggingface.co/datasets/Kevin355/Who_and_When
 
 
Research Background and Challenges
LLM-driven Multi-Agent systems have demonstrated immense potential across many domains. However, these systems are fragile; errors by a single agent, misunderstandings between agents, or mistakes in information transmission can lead to the failure of the entire task.

Currently, when a system fails, developers are often left with manual and inefficient methods for debugging:
Manual Log Archaeology : Developers must manually review lengthy interaction logs to find the source of the problem.
Reliance on Expertise : The debugging process is highly dependent on the developer’s deep understanding of the system and the task at hand.
 
This “needle in a haystack” approach to debugging is not only inefficient but also severely hinders rapid system iteration and the improvement of system reliability. There is an urgent need for an automated, systematic method to pinpoint the cause of failures, effectively bridging the gap between “evaluation results” and “system improvement.”



Core Contributions
This paper makes several groundbreaking contributions to address the challenges above:
1. Defining a New Problem: The paper is the first to formalize “automated failure attribution” as a specific research task. This task is defined by identifying the

2. failure-responsible agent and the decisive error step that led to the task’s failure.

Constructing the First Benchmark Dataset: Who&When : This dataset includes a wide range of failure logs collected from 127 LLM Multi-Agent systems, which were either algorithmically generated or hand-crafted by experts to ensure realism and diversity. Each failure log is accompanied by fine-grained human annotations for:
Who: The agent responsible for the failure.
When: The specific interaction step where the decisive error occurred.
Why: A natural language explanation of the cause of the failure.

3. Exploring Initial “Automated Attribution” Methods : Using the Who&When dataset, the paper designs and assesses three distinct methods for automated failure attribution:
All-at-Once: This method provides the LLM with the user query and the complete failure log, asking it to identify the responsible agent and the decisive error step in a single pass. While cost-effective, it may struggle to pinpoint precise errors in long contexts.
Step-by-Step: This approach mimics manual debugging by having the LLM review the interaction log sequentially, making a judgment at each step until the error is found. It is more precise at locating the error step but incurs higher costs and risks accumulating errors.
Binary Search: A compromise between the first two methods, this strategy repeatedly divides the log in half, using the LLM to determine which segment contains the error. It then recursively searches the identified segment, offering a balance of cost and performance.
 
Experimental Results and Key Findings

Experiments were conducted in two settings: one where the LLM knows the ground truth answer to the problem the Multi-Agent system is trying to solve (With Ground Truth) and one where it does not (Without Ground Truth). The primary model used was GPT-4o, though other models were also tested. The systematic evaluation of these methods on the Who&When dataset yielded several important insights:

  • A Long Way to Go: Current methods are far from perfect. Even the best-performing single method achieved an accuracy of only about 53.5% in identifying the responsible agent and a mere 14.2% in pinpointing the exact error step. Some methods performed even worse than random guessing, underscoring the difficulty of the task.
  • No “All-in-One” Solution: Different methods excel at different aspects of the problem. The All-at-Once method is better at identifying “Who,” while the Step-by-Step method is more effective at determining “When.” The Binary Search method provides a middle-ground performance.
  • Hybrid Approaches Show Promise but at a High Cost: The researchers found that combining different methods, such as using the All-at-Once approach to identify a potential agent and then applying the Step-by-Step method to find the error, can improve overall performance. However, this comes with a significant increase in computational cost.
  • State-of-the-Art Models Struggle: Surprisingly, even the most advanced reasoning models, like OpenAI o1 and DeepSeek R1, find this task challenging. This highlights the inherent difficulty of automated failure attribution, which demands a higher level of reasoning than what is required for more conventional tasks.
  • The Importance of Explicit Reasoning: Providing explicit prompts that require the LLM to explain its reasoning in the All-at-Once and Step-by-Step methods was shown to improve performance.
  • Context Length is a Limiting Factor: The study also revealed that as the context length of the failure logs increases, the performance of all attribution methods tends to decrease, with a more pronounced impact on the accuracy of identifying the error step.

About Synced

Machine Intelligence | Technology & Industry | Information & Analysis

113 comments on “Researchers from PSU and Duke introduce “Multi-Agent Systems Automated Failure Attribution

  1. Pingback: MAROKO133 Update ai: Details Emerge on Sam Altman’s Panic Sweats Hari Ini - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  2. Pingback: MAROKO133 Hot ai: Researchers report first evidence of solar neutrinos flipping carbon int - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  3. Pingback: MAROKO133 Update ai: Trump enters AI talent war with US Tech Force hiring 1,000 engineers - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  4. This article covers a useful step toward making multi-agent AI systems easier to debug. Automating failure attribution can save a lot of time compared to manual log checks and guesswork. The benchmark and approach look practical for real systems, not just theory. It’s good to see research that focuses on reliability as these systems grow more complex.

  5. I just read about the Who&When benchmark, and it’s a game changer for debugging LLM Multi‑Agent systems. Knowing which agent failed and when cuts down the debugging time a ton. Can’t wait to see how this tool boosts reliability in real‑world projects!

  6. Lyla Sky

    Nofal Apparel is a trusted Clothing Manufacturer known for quality materials and reliable production standards. The brand focuses on precision tailoring and consistent design across its collections. Each garment reflects a balance of durability, comfort, and modern style.

  7. Pingback: MAROKO133 Breaking ai: Woman Suffers AI Psychosis After Obsessively Generating AI Images o - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  8. Pingback: MAROKO133 Update ai: Uncles Tremble as Man Invents Vaccine Delivered by Beer Hari Ini - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  9. This paper is very interesting! It talks about how to find out which robot did something wrong when working together. The researchers found a new way to help us understand the mistakes. I like that they make it easier for people to fix problems faster. It is good to know more about how robots can work together. Great job!

  10. Pingback: TOPINDIATOURS Breaking ai: ByteDance Introduces Astra: A Dual-Model Architecture for Auton – TOPINDIATOURS

  11. Pingback: TOPINDIATOURS Update ai: Researchers from PSU and Duke introduce “Multi-Agent Systems Auto – TOPINDIATOURS

  12. Pingback: TOPINDIATOURS Eksklusif ai: Days After Mass Bricking Event, Waymo Fleet Shuts Down Again E – TOPINDIATOURS

  13. Pingback: TOPINDIATOURS Eksklusif ai: After Outcry, Firefox Promises “Kill Switch” That Turns Off Al – TOPINDIATOURS

  14. Pingback: TOPINDIATOURS Breaking ai: Alzheimer’s Fully Reversed in Mice, Scientists Say Edisi Jam 09 – TOPINDIATOURS

  15. Pingback: MAROKO133 Breaking ai: X-62A VISTA: World’s only self-flying F-16 advances path to autonom - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  16. Pingback: TOPINDIATOURS Eksklusif ai: 7 advances in medicine from 2025 that offered new hope for sev – TOPINDIATOURS

  17. Pingback: MAROKO133 Eksklusif ai: Mark Zuckerberg’s Former Top AI Scientist Reveals Exactly Why He Q - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  18. Pingback: TOPINDIATOURS Hot ai: ByteDance Introduces Astra: A Dual-Model Architecture for Autonomous – TOPINDIATOURS

  19. Chrystopher Smith

    Goldblades is a premium brand offering professional barber tools designed for precision and durability. Their products, including razors, clippers, and grooming accessories, ensure a smooth and reliable experience. Goldblades combines quality craftsmanship with modern design for both barbers and personal grooming enthusiasts.

  20. Dennis Butler

    Blox Fruits immerses players in a vibrant open-world sea where discovering new islands, enhancing abilities, and conquering enemies define your progress.

  21. Afaq shah

    Remove FRP lock easily using GSMNeo FRP. Access detailed guides, tools, and tips for Google account bypass on Android devices. For more updates explore https://frp-gsmneo.mx/

  22. The concept of “Automated Failure Attribution” for Multi-Agent systems is groundbreaking! It simplifies troubleshooting, making it more efficient. While you’re at it, if you enjoy strategic challenges, check out this fun puzzle game: Block Blast.

  23. The introduction of automated failure attribution in Multi-Agent systems is a groundbreaking move. It simplifies the complex issue of fault identification. Kudos to the researchers! By the way, if you’re looking for a fun challenge, check out this game: ブロックブラスト.

  24. afkia jhn

    Find out how to check Nol Card balance online, top up your card easily, and use it across Dubai Metro, buses, parking, and everyday commuting. For more updates explore https://nolcardscheck.ae/

  25. Pingback: TOPINDIATOURS Breaking ai: Majority of CEOs Alarmed as AI Delivers No Financial Returns Wa – TOPINDIATOURS

  26. Pingback: MAROKO133 Eksklusif ai: ByteDance Introduces Astra: A Dual-Model Architecture for Autonomo - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  27. Pingback: MAROKO133 Update ai: Spinning-mass robots that roll and swim could soon achieve insect-lik - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  28. great idea
    we like your page <3
    nice job

  29. Pingback: MAROKO133 Eksklusif ai: Claude Code costs up to $200 a month. Goose does the same thing fo - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  30. Pingback: MAROKO133 Update ai: Railway secures $100 million to challenge AWS with AI-native cloud in - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  31. Pingback: TOPINDIATOURS Breaking ai: New three-layer electrode pulls CO2 from exhaust gases to make – TOPINDIATOURS

  32. Pingback: MAROKO133 Breaking ai: ByteDance Introduces Astra: A Dual-Model Architecture for Autonomou - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  33. Pingback: TOPINDIATOURS Hot ai: On-demand telecom photon source sets record 92% interference for qua – TOPINDIATOURS

  34. Pingback: MAROKO133 Breaking ai: Claude Code costs up to $200 a month. Goose does the same thing for - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  35. Pingback: TOPINDIATOURS Update ai: Mamdani Forces Delivery Apps to Pay Back $4.6 Million Cheated Fro – TOPINDIATOURS

  36. Pingback: MAROKO133 Hot ai: Alarm Grows as Social Network Entirely for AI Starts Plotting Against Hu - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  37. Pingback: TOPINDIATOURS Hot ai: Railway secures $100 million to challenge AWS with AI-native cloud i – TOPINDIATOURS

  38. Pingback: TOPINDIATOURS Hot ai: Salesforce rolls out new Slackbot AI agent as it battles Microsoft a – TOPINDIATOURS

  39. Pingback: MAROKO133 Breaking ai: Chinese firm unveils new pyramid-shaped PC that runs AI locally, no - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  40. Pingback: MAROKO133 Breaking ai: Railway secures $100 million to challenge AWS with AI-native cloud - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  41. Pingback: TOPINDIATOURS Hot ai: US military laser tests and suspected cartel drones trigger airspace – TOPINDIATOURS

  42. Pingback: MAROKO133 Update ai: NASA Running Out of Non-Life Explanations for What Its Rover Found on - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  43. Pingback: TOPINDIATOURS Breaking ai: Adobe Research Unlocking Long-Term Memory in Video World Models – TOPINDIATOURS

  44. LexiNerd

    This research shows how clearly defined rules and structured systems help people understand complex processes and improve results over time. Breaking things down step by step makes learning more effective, whether it’s in advanced multi-agent systems or everyday problem-solving activities. The same idea applies to word-based challenges, where understanding the basic rules first helps players perform better and think more strategically. I found a helpful explanation of this approach in spelling games here:

    https://spellbee.us/blogs/how-to-play-spelling-bee/

  45. Pingback: TOPINDIATOURS Update ai: Anthropic launches Cowork, a Claude Desktop agent that works in y – TOPINDIATOURS

  46. Managing your Aadhaar doesn’t have to be complicated. With My Aadhaar by UIDAI, you can easily update your details, check your Aadhaar status, download your Aadhaar, and access all related services online through one trusted platform. Authentication ensures your identity is secure, streamlining essential services and saving you time while avoiding unnecessary visits.

  47. Managing your Aadhaar doesn’t have to be complicated. With
    [**My Aadhaar by UIDAI**](https://myaadhaarcarduidai.com/),
    you can easily update your details, check your Aadhaar status, download your Aadhaar, and access all related services online through one trusted platform. Authentication ensures your identity remains secure, streamlining essential services, saving you time, and helping you avoid unnecessary visits.

  48. Pingback: TOPINDIATOURS Eksklusif ai: Adobe Research Unlocking Long-Term Memory in Video World Model – TOPINDIATOURS

  49. Pingback: MAROKO133 Update ai: Claude Code costs up to $200 a month. Goose does the same thing for f - Maroko133 : Akses Mudah Ke Pusat Hiburan Digital Terpercaya

  50. Pingback: TOPINDIATOURS Hot ai: Anthropic launches Cowork, a Claude Desktop agent that works in your – TOPINDIATOURS

Leave a Reply

Your email address will not be published. Required fields are marked *