Quality Assurance with Agents
AI-generated code changes the code review and testing dynamic. Your policies need to adapt.
Code Review Policy Options
Section titled “Code Review Policy Options”Should AI-generated code be reviewed differently? Most teams land in the middle.
| Policy | How it works | Best for |
|---|---|---|
| No differentiation | Same review for all code | Small, high-trust teams with rigorous reviews |
| Disclosure only | Authors flag significant AI code | Transparency without bureaucracy |
| Tiered review | Extra scrutiny on critical paths + AI | Risk-varying codebases |
| AI-assisted review | AI pre-reviews; humans focus on what AI misses | High PR volume teams |
If You Require Disclosure
Section titled “If You Require Disclosure”Make it easy:
- Clear trigger: “Disclose when >20% of the PR was generated by AI tools”
- Simple mechanism: Checkbox in PR template or a tag/label
- No stigma: Disclosure is information, not judgment
- Useful metadata: Which tool? What prompts? Helps with learning
Review Checklist for AI Code
Section titled “Review Checklist for AI Code”Train reviewers to watch for AI-specific issues:
- Hallucinated APIs or methods
- Plausible but incorrect logic
- Missing edge case handling
- Inconsistent with existing patterns
- Over-engineered for the task
Plus standard checks: meets requirements, handles errors, security addressed, tests adequate, docs updated.
Building Reviewer Skills
Section titled “Building Reviewer Skills”Some reviewers catch AI mistakes better than others. This is trainable.
- Share examples of AI failures caught in review
- Pair junior reviewers with experienced ones
- Create a team knowledge base of AI pitfalls
What Not to Do
Section titled “What Not to Do”- Don’t create a separate “AI code” branch—integration nightmares
- Don’t require manager approval for AI usage—kills adoption
- Don’t ignore the conversation—pretending AI isn’t changing things helps no one
Shift Testing Left
Section titled “Shift Testing Left”Traditional: Write code → Test → Review → Deploy
Agentic: Write spec → Generate tests → Generate code → Verify tests pass → Review → Deploy
Tests come before implementation.
Why This Works
Section titled “Why This Works”- Tests define success criteria for the agent
- Agents can run tests to self-validate
- Fewer iterations when the goal is clear
- Tests document intent
How to Implement
Section titled “How to Implement”Write tests as part of task definition. Before asking an agent to implement a feature, write (or generate) the tests it should pass.
Use TDD prompting. “Here are the tests. Write code that makes them pass.”
Treat test failures as agent feedback. The test suite catches bugs, not you.
Using Agents for Test Generation
Section titled “Using Agents for Test Generation”Agents excel at writing tests. Use this.
Unit tests: Give agent a function, get test cases back. Review for coverage gaps.
Edge cases: Agents are good at imagining cases you might miss.
Integration tests: More complex, but agents can generate scaffolding.
The Workflow
Section titled “The Workflow”- Point agent at a module
- Request tests covering happy path, edges, and errors
- Review for gaps and hallucinated behavior
- Refine until coverage is meaningful
Watch For
Section titled “Watch For”- Tests that pass for wrong reasons
- Mocked dependencies hiding real issues
- Tests that don’t actually test the requirement
- Copy-paste tests that don’t add coverage
Test Coverage as Guardrail
Section titled “Test Coverage as Guardrail”High test coverage makes agentic development safer.
- Minimum coverage gates: Don’t let agent-generated code reduce coverage
- Critical path requirements: Some paths need 100% coverage with meaningful tests
- Coverage trends: Track whether agent adoption correlates with coverage changes
Testing Agent Output
Section titled “Testing Agent Output”Not just testing code—testing agent behavior itself.
Acceptance criteria: Define what code should do, shouldn’t do, edge cases, and verification method.
Canary testing (for automated workflows):
- Run agent changes through extended test suites before merge
- Stage behind feature flags
- Monitor for anomalies after deployment
Regression tracking: Notice patterns—do certain task types introduce more bugs? Are there problematic codepaths?
The Test Pyramid for Agents
Section titled “The Test Pyramid for Agents”- Unit tests (foundation): Fast, focused, run constantly. Agent-generated with human review.
- Integration tests (middle): Verify components work together. Human-guided generation.
- E2E tests (top): Verify full user flows. Fewer but critical. Often still human-written.
- Contract tests (boundaries): Verify API contracts. Especially important when agents modify interfaces.
Resources
Section titled “Resources”Essential
Section titled “Essential”- Your job is to deliver code you have proven to work - Standards for reviewing AI-assisted code
- Agent Readiness – Eno Reyes, Factory AI - How testing infrastructure affects agent reliability