Verifying Delivery Risk Agent Report Accuracy

Verifying Delivery Risk Agent Report Accuracy

Overview

Allstacks Deep Agent Reports leverage Large Language Models (LLMs) to analyze complex datasets and generate actionable insights. While these reports provide significant value, the underlying AI technology operates non-deterministically and can produce errors. This article explains why verification matters, how to perform accuracy checks, and what we're doing to improve reliability.

Thank you for your collaboration and efforts with us.

Why Verification Matters

The Nature of AI-Generated Analysis

Deep Agent Reports are generated by AI agents that process multiple data sources simultaneously. Unlike traditional software that produces identical outputs from identical inputs, LLMs operate probabilistically—meaning the same inputs can produce slightly different outputs across runs.

More importantly, these AI systems face a fundamental challenge: context window management. When an agent must analyze hundreds of work items, cross-reference contributor data, calculate statistics, and synthesize findings into a coherent narrative, it's processing an enormous amount of information. This can lead to:

  • Numerical drift – Statistics that don't quite match the source data

  • Fabricated specifics – The model "filling in" details that seem reasonable but aren't in the source

  • Conflation errors – Combining or confusing similar but distinct data points

  • Framing distortions – Technically accurate facts presented with misleading emphasis

These aren't signs of a broken system—they're inherent characteristics of current LLM technology that we must account for through verification.

Our Commitment to Accuracy

We take the accuracy of our reports seriously. While AI-generated analysis provides tremendous leverage for understanding complex delivery situations, we recognize that trust requires verification. We are actively working to:

  1. Include automated verification checks as part of each Deep Agent run

  2. Improve our prompting and data pipeline to reduce context window pressure

  3. Provide transparency about what the AI can and cannot reliably determine

  4. Give you the tools to verify findings that inform important decisions

Until automated verification is fully integrated, we recommend manual verification for reports that will drive significant decisions.

How to Verify a Deep Agent Report

Step 1: Gather the Source Files

Each Delivery Risk Report is generated from specific data exports. To verify a report, you'll need access to the same source files the AI analyzed. Typically these include:

File Type

Description

Example Filename

Per-Item Risk Details (CSV)

Individual work item data with risk scores, status, comments, and metadata

Allstacks-per-item-risk-details-[initiative].csv

Project Actors Summary (TXT)

Contributor hours, roles, and team assignments

Allstacks-project-actors-[initiative].txt

Generated Report (PDF)

The AI-generated report being verified

text-only-risk-report-[initiative].pdf

Contact your Allstacks representative if you need access to the source files for a specific report.

Step 2: Run the Verification Prompt

Use the following prompt with Claude (or another capable LLM) to systematically check the report against its source data. Attach all source files along with the generated report.

VERIFICATION PROMPT - Version A (Simple Output)

find any claims in the text only risk report that contradict or seem dubious in light of the other attachments. rank findings in order of most clearly inaccurate to simply dubious or exaggerated claims

 

VERIFICATION PROMPT - Version B (Detailed Output)

# Report Verification Instructions  You are verifying a generated risk report against its source data. Your job is to identify factual inaccuracies, unsupported claims, and misleading framing. Be skeptical and precise.  ## Verification Categories  ### 1. NUMERICAL ACCURACY (Highest Priority) For every specific number in the report, trace it to source data: - Status counts (Done, To Do, In Progress, etc.) - recount from raw data - Hour figures for individuals - confirm each person's hours appear in source - Percentages - recalculate, don't trust the report's math - Counts of comments, PRs, approvals, bouncebacks - verify against item-level data  Flag as INACCURATE if: Number doesn't match source, or source doesn't contain that number at all.  ### 2. DIRECT CLAIMS VS SOURCE DATA For claims about specific tickets (e.g., "TACO-414 had zero review comments"): - Look up that exact ticket in the source data - Compare the claim word-for-word against what the source says - Check fields like: num_comments, num_bouncebacks, state, resolution, risk scores  Flag as CONTRADICTED if: Source data directly says something different.  ### 3. AGGREGATION AND CATEGORIZATION When the report groups or categorizes items: - Verify the grouping logic matches the source - Check if categories were combined (e.g., "Code Review / QA" combining two distinct states) - Confirm threshold definitions (e.g., "High Activity (80+ hours)") are applied consistently  Flag as MISCATEGORIZED if: Items placed in wrong buckets, or buckets artificially combined.  ### 4. SOURCING GAPS Identify claims where: - The source data doesn't provide the information at all - The report fills in blanks with invented specifics - Multiple sources conflict and the report picks one without noting the conflict  Flag as UNSOURCED if: No source data supports the claim. Flag as CONFLICTED if: Sources disagree and report doesn't acknowledge it.  ### 5. EDITORIAL FRAMING (Lower Priority but Note) Identify when technically-accurate facts are framed misleadingly: - Completed items cited as evidence of ongoing problems - Low risk scores items described in crisis language - Administrative issues (stale epics) framed as delivery failures - Isolated incidents generalized as "systematic patterns"  Flag as MISLEADING FRAMING if: Facts are accurate but presentation distorts their significance.  ## Output Format  For each finding, provide:  [SEVERITY: INACCURATE | CONTRADICTED | MISCATEGORIZED | UNSOURCED | CONFLICTED | MISLEADING FRAMING]  Report Claim (Page X): "[exact quote from report]"  Source Data Shows: [what the actual data says, with specific field values]  Discrepancy: [precise description of the difference]  Rank findings by severity: 1. INACCURATE/CONTRADICTED - report states something factually wrong 2. UNSOURCED - report invents information not in sources 3. MISCATEGORIZED/CONFLICTED - structural or aggregation errors 4. MISLEADING FRAMING - accurate but spun  ## Verification Checklist  Before concluding, confirm you've checked: [ ] All status counts recalculated from raw item data [ ] All individual hour figures traced to Project Actors or equivalent source [ ] All "zero comments" or "zero review" claims verified against num_comments field [ ] All percentage calculations redone [ ] All ticket-specific claims cross-referenced to that ticket's record [ ] Top contributors list verified against source hours data [ ] Weekly velocity/hours tables spot-checked against source  ## Important Principles  1. Numbers must trace to source - if you can't find where a number came from, flag it 2. Absence of evidence isn't evidence - "zero comments" requires seeing num_comments = 0 3. Completed ≠ Problem-free, but also ≠ Crisis - note when report catastrophizes resolved issues 4. Source conflicts are findings - if the source data is internally inconsistent, note it 5. Don't assume the report is right - verify, don't rationalize

 

Step 3: Interpret the Results

The verification output will categorize findings by severity:

Severity

What It Means

Action Required

INACCURATE / CONTRADICTED

Report states something factually wrong

Do not rely on this claim; use source data instead

UNSOURCED

Report includes information not present in source data

Treat as unverified; investigate independently if important

MISCATEGORIZED / CONFLICTED

Structural errors in how data was grouped or presented

Recalculate aggregates yourself if they matter to your decision

MISLEADING FRAMING

Facts are correct but emphasis distorts meaning

Consider the underlying facts separately from the narrative

Common Error Patterns

Based on our analysis, these are the most frequent issues to watch for:

  • Status Count Errors: The report may miscount items by state (Done, To Do, In Progress). Always verify the status breakdown by counting items in the CSV yourself.

  • Fabricated Hour Figures: When the Project Actors file doesn't specify hours for someone, the report may invent a plausible number. Check that every hour figure appears verbatim in the source.

  • "Zero Comments" Claims: Claims that PRs or tickets received "zero review comments" should be verified against the num_comments field in the CSV. This is a frequent error.

  • Combined Categories: The report may combine distinct statuses (e.g., "Code Review / QA: 8") when the source data has them separate. This inflates some categories.

  • Completed Items as Crisis Evidence: Items that are marked Done with low risk scores may still be cited as evidence of process problems. Check the state, resolution, and delivery_risk_score fields.

What We're Doing to Improve

Near-Term: Automated Verification

We are integrating verification checks directly into the Deep Agent pipeline. We are evaluating inclusions of:

  • Automated status recounts validated against source data before report generation

  • Citation requirements forcing the model to reference specific source fields

  • Confidence indicators on claims that required inference vs. direct data lookup

  • Flagged gaps where source data was missing, and the model filled in details

Medium-Term: Context Window Optimization

The root cause of many errors is context window pressure—the AI trying to hold too much information at once. We're working on:

  • Chunked processing that analyzes data in smaller, more manageable pieces

  • Structured intermediate outputs that reduce the need for the model to "remember" raw data

  • Targeted retrieval that pulls specific data points on demand rather than loading everything upfront

Ongoing: Transparency and Feedback

Every Deep Agent Report includes a disclaimer about AI-generated content. We encourage you to:

  • Report inaccuracies you discover so we can improve our prompts and pipelines

  • Share verification findings with your Allstacks representative

  • Request source files for any report you need to verify

Questions?

If you have questions about verifying a specific report or want to discuss accuracy concerns, contact your Allstacks representative or reach out to support@allstacks.com.

Last updated: January 2026