Skip to content

Quality Assurance with Agents

AI-generated code changes the code review and testing dynamic. Your policies need to adapt.

Should AI-generated code be reviewed differently? Most teams land in the middle.

PolicyHow it worksBest for
No differentiationSame review for all codeSmall, high-trust teams with rigorous reviews
Disclosure onlyAuthors flag significant AI codeTransparency without bureaucracy
Tiered reviewExtra scrutiny on critical paths + AIRisk-varying codebases
AI-assisted reviewAI pre-reviews; humans focus on what AI missesHigh PR volume teams

Make it easy:

  • Clear trigger: “Disclose when >20% of the PR was generated by AI tools”
  • Simple mechanism: Checkbox in PR template or a tag/label
  • No stigma: Disclosure is information, not judgment
  • Useful metadata: Which tool? What prompts? Helps with learning

Train reviewers to watch for AI-specific issues:

  • Hallucinated APIs or methods
  • Plausible but incorrect logic
  • Missing edge case handling
  • Inconsistent with existing patterns
  • Over-engineered for the task

Plus standard checks: meets requirements, handles errors, security addressed, tests adequate, docs updated.

Some reviewers catch AI mistakes better than others. This is trainable.

  • Share examples of AI failures caught in review
  • Pair junior reviewers with experienced ones
  • Create a team knowledge base of AI pitfalls
  • Don’t create a separate “AI code” branch—integration nightmares
  • Don’t require manager approval for AI usage—kills adoption
  • Don’t ignore the conversation—pretending AI isn’t changing things helps no one

Traditional: Write code → Test → Review → Deploy

Agentic: Write spec → Generate tests → Generate code → Verify tests pass → Review → Deploy

Tests come before implementation.

  • Tests define success criteria for the agent
  • Agents can run tests to self-validate
  • Fewer iterations when the goal is clear
  • Tests document intent

Write tests as part of task definition. Before asking an agent to implement a feature, write (or generate) the tests it should pass.

Use TDD prompting. “Here are the tests. Write code that makes them pass.”

Treat test failures as agent feedback. The test suite catches bugs, not you.

Agents excel at writing tests. Use this.

Unit tests: Give agent a function, get test cases back. Review for coverage gaps.

Edge cases: Agents are good at imagining cases you might miss.

Integration tests: More complex, but agents can generate scaffolding.

  1. Point agent at a module
  2. Request tests covering happy path, edges, and errors
  3. Review for gaps and hallucinated behavior
  4. Refine until coverage is meaningful
  • Tests that pass for wrong reasons
  • Mocked dependencies hiding real issues
  • Tests that don’t actually test the requirement
  • Copy-paste tests that don’t add coverage

High test coverage makes agentic development safer.

  • Minimum coverage gates: Don’t let agent-generated code reduce coverage
  • Critical path requirements: Some paths need 100% coverage with meaningful tests
  • Coverage trends: Track whether agent adoption correlates with coverage changes

Not just testing code—testing agent behavior itself.

Acceptance criteria: Define what code should do, shouldn’t do, edge cases, and verification method.

Canary testing (for automated workflows):

  • Run agent changes through extended test suites before merge
  • Stage behind feature flags
  • Monitor for anomalies after deployment

Regression tracking: Notice patterns—do certain task types introduce more bugs? Are there problematic codepaths?

  • Unit tests (foundation): Fast, focused, run constantly. Agent-generated with human review.
  • Integration tests (middle): Verify components work together. Human-guided generation.
  • E2E tests (top): Verify full user flows. Fewer but critical. Often still human-written.
  • Contract tests (boundaries): Verify API contracts. Especially important when agents modify interfaces.