Verifying Delivery Risk Agent Report Accuracy
Overview
Allstacks Deep Agent Reports leverage Large Language Models (LLMs) to analyze complex datasets and generate actionable insights. While these reports provide significant value, the underlying AI technology operates non-deterministically and can produce errors. This article explains why verification matters, how to perform accuracy checks, and what we're doing to improve reliability.
Thank you for your collaboration and efforts with us.
Why Verification Matters
The Nature of AI-Generated Analysis
Deep Agent Reports are generated by AI agents that process multiple data sources simultaneously. Unlike traditional software that produces identical outputs from identical inputs, LLMs operate probabilistically—meaning the same inputs can produce slightly different outputs across runs.
More importantly, these AI systems face a fundamental challenge: context window management. When an agent must analyze hundreds of work items, cross-reference contributor data, calculate statistics, and synthesize findings into a coherent narrative, it's processing an enormous amount of information. This can lead to:
Numerical drift – Statistics that don't quite match the source data
Fabricated specifics – The model "filling in" details that seem reasonable but aren't in the source
Conflation errors – Combining or confusing similar but distinct data points
Framing distortions – Technically accurate facts presented with misleading emphasis
These aren't signs of a broken system—they're inherent characteristics of current LLM technology that we must account for through verification.
Our Commitment to Accuracy
We take the accuracy of our reports seriously. While AI-generated analysis provides tremendous leverage for understanding complex delivery situations, we recognize that trust requires verification. We are actively working to:
Include automated verification checks as part of each Deep Agent run
Improve our prompting and data pipeline to reduce context window pressure
Provide transparency about what the AI can and cannot reliably determine
Give you the tools to verify findings that inform important decisions
Until automated verification is fully integrated, we recommend manual verification for reports that will drive significant decisions.
How to Verify a Deep Agent Report
Step 1: Gather the Source Files
Each Delivery Risk Report is generated from specific data exports. To verify a report, you'll need access to the same source files the AI analyzed. Typically these include:
File Type | Description | Example Filename |
Per-Item Risk Details (CSV) | Individual work item data with risk scores, status, comments, and metadata | Allstacks-per-item-risk-details-[initiative].csv |
Project Actors Summary (TXT) | Contributor hours, roles, and team assignments | Allstacks-project-actors-[initiative].txt |
Generated Report (PDF) | The AI-generated report being verified | text-only-risk-report-[initiative].pdf |
Contact your Allstacks representative if you need access to the source files for a specific report.
Step 2: Run the Verification Prompt
Use the following prompt with Claude (or another capable LLM) to systematically check the report against its source data. Attach all source files along with the generated report.
VERIFICATION PROMPT - Version A (Simple Output)
find any claims in the text only risk report that contradict or seem dubious in light of the other attachments. rank findings in order of most clearly inaccurate to simply dubious or exaggerated claims
VERIFICATION PROMPT - Version B (Detailed Output)
# Report Verification Instructions You are verifying a generated risk report against its source data. Your job is to identify factual inaccuracies, unsupported claims, and misleading framing. Be skeptical and precise.
## Verification Categories ### 1. NUMERICAL ACCURACY (Highest Priority) For every specific number in the report, trace it to source data: - Status counts (Done, To Do, In Progress, etc.) - recount from raw data - Hour figures for individuals - confirm each person's hours appear in source - Percentages - recalculate, don't trust the report's math - Counts of comments, PRs, approvals, bouncebacks - verify against item-level data Flag as INACCURATE if: Number doesn't match source, or source doesn't contain that number at all.
### 2. DIRECT CLAIMS VS SOURCE DATA For claims about specific tickets (e.g., "TACO-414 had zero review comments"): - Look up that exact ticket in the source data - Compare the claim word-for-word against what the source says - Check fields like: num_comments, num_bouncebacks, state, resolution, risk scores Flag as CONTRADICTED if: Source data directly says something different.
### 3. AGGREGATION AND CATEGORIZATION When the report groups or categorizes items: - Verify the grouping logic matches the source - Check if categories were combined (e.g., "Code Review / QA" combining two distinct states) - Confirm threshold definitions (e.g., "High Activity (80+ hours)") are applied consistently Flag as MISCATEGORIZED if: Items placed in wrong buckets, or buckets artificially combined.
### 4. SOURCING GAPS Identify claims where: - The source data doesn't provide the information at all - The report fills in blanks with invented specifics - Multiple sources conflict and the report picks one without noting the conflict Flag as UNSOURCED if: No source data supports the claim. Flag as CONFLICTED if: Sources disagree and report doesn't acknowledge it.
### 5. EDITORIAL FRAMING (Lower Priority but Note) Identify when technically-accurate facts are framed misleadingly: - Completed items cited as evidence of ongoing problems - Low risk scores items described in crisis language - Administrative issues (stale epics) framed as delivery failures - Isolated incidents generalized as "systematic patterns" Flag as MISLEADING FRAMING if: Facts are accurate but presentation distorts their significance.
## Output Format For each finding, provide: [SEVERITY: INACCURATE | CONTRADICTED | MISCATEGORIZED | UNSOURCED | CONFLICTED | MISLEADING FRAMING] Report Claim (Page X): "[exact quote from report]" Source Data Shows: [what the actual data says, with specific field values] Discrepancy: [precise description of the difference]
Rank findings by severity: 1. INACCURATE/CONTRADICTED - report states something factually wrong 2. UNSOURCED - report invents information not in sources 3. MISCATEGORIZED/CONFLICTED - structural or aggregation errors 4. MISLEADING FRAMING - accurate but spun
## Verification Checklist Before concluding, confirm you've checked: [ ] All status counts recalculated from raw item data [ ] All individual hour figures traced to Project Actors or equivalent source [ ] All "zero comments" or "zero review" claims verified against num_comments field [ ] All percentage calculations redone [ ] All ticket-specific claims cross-referenced to that ticket's record [ ] Top contributors list verified against source hours data [ ] Weekly velocity/hours tables spot-checked against source
## Important Principles 1. Numbers must trace to source - if you can't find where a number came from, flag it 2. Absence of evidence isn't evidence - "zero comments" requires seeing num_comments = 0 3. Completed ≠ Problem-free, but also ≠ Crisis - note when report catastrophizes resolved issues 4. Source conflicts are findings - if the source data is internally inconsistent, note it 5. Don't assume the report is right - verify, don't rationalize
Step 3: Interpret the Results
The verification output will categorize findings by severity:
Severity | What It Means | Action Required |
INACCURATE / CONTRADICTED | Report states something factually wrong | Do not rely on this claim; use source data instead |
UNSOURCED | Report includes information not present in source data | Treat as unverified; investigate independently if important |
MISCATEGORIZED / CONFLICTED | Structural errors in how data was grouped or presented | Recalculate aggregates yourself if they matter to your decision |
MISLEADING FRAMING | Facts are correct but emphasis distorts meaning | Consider the underlying facts separately from the narrative |
Common Error Patterns
Based on our analysis, these are the most frequent issues to watch for:
Status Count Errors: The report may miscount items by state (Done, To Do, In Progress). Always verify the status breakdown by counting items in the CSV yourself.
Fabricated Hour Figures: When the Project Actors file doesn't specify hours for someone, the report may invent a plausible number. Check that every hour figure appears verbatim in the source.
"Zero Comments" Claims: Claims that PRs or tickets received "zero review comments" should be verified against the num_comments field in the CSV. This is a frequent error.
Combined Categories: The report may combine distinct statuses (e.g., "Code Review / QA: 8") when the source data has them separate. This inflates some categories.
Completed Items as Crisis Evidence: Items that are marked Done with low risk scores may still be cited as evidence of process problems. Check the state, resolution, and delivery_risk_score fields.
What We're Doing to Improve
Near-Term: Automated Verification
We are integrating verification checks directly into the Deep Agent pipeline. We are evaluating inclusions of:
Automated status recounts validated against source data before report generation
Citation requirements forcing the model to reference specific source fields
Confidence indicators on claims that required inference vs. direct data lookup
Flagged gaps where source data was missing, and the model filled in details
Medium-Term: Context Window Optimization
The root cause of many errors is context window pressure—the AI trying to hold too much information at once. We're working on:
Chunked processing that analyzes data in smaller, more manageable pieces
Structured intermediate outputs that reduce the need for the model to "remember" raw data
Targeted retrieval that pulls specific data points on demand rather than loading everything upfront
Ongoing: Transparency and Feedback
Every Deep Agent Report includes a disclaimer about AI-generated content. We encourage you to:
Report inaccuracies you discover so we can improve our prompts and pipelines
Share verification findings with your Allstacks representative
Request source files for any report you need to verify
Questions?
If you have questions about verifying a specific report or want to discuss accuracy concerns, contact your Allstacks representative or reach out to support@allstacks.com.
Last updated: January 2026