Review test results

Review test results to identify issues and verify agent behavior.

View results for a test execution

Go to your test group.
Go to the Past runs tab.
Select the date of the test execution you want to review.

The results page shows:

Overall performance - Total test cases, number of passes (agent called the required tools and gave correct answers), and number of fails (agent did not call the required tools or gave incorrect answers).
Test case details - For each test case: user input, overall result (Pass/Fail), responses matched (Pass/Fail), and tools matched (Pass/Fail).

View the details of a test case

A test case passes when the agent response closely matches the expected response. The agent response does not need to be an exact match. A test case fails if even one agent message, tool call, or subagent call does not closely match the expected response.

Select the test case that you want to view.
Review the aggregated summary:
- Result: Overall pass or fail status.
- Responses matched: Whether the agent response matched the expected response, for example: 2/2.
- Tools matched: Whether tool and agent calls matched expectations, for example: 4/4.
Compare expected and actual behavior:
- Expected behavior (left column): The test case that you configured.
- Actual behavior (right column): How the agent actually responded.
Identify the failure point by reviewing each agent message, tool call, and subagent call. Look for Pass or Fail indicators next to each element. The overall result fails if any of the following occur:
- Agent message does not match the expected response.
- Tool call does not match the expected tool.
- Subagent call does not match the expected subagent.
Use the insights to update your agent or test cases.

Compare results across multiple runs

If the test execution has multiple runs, you can view the overall results in addition to the details for each run.

Column	Description
Test cases	Total number of test cases in the test group.
Pass rate	Average of passed test cases across all runs.
Fail rate	Average of failed test cases across all runs.

Review the results for each run to identify consistency issues. Each test run is displayed in a separate tab with an indicator showing whether the run passed or failed.

Next steps

Update the AI agent: If the test results are not as expected, go to the agent in the AI agent builder to edit it and create a new version.
Save the agent: Save the agent after testing passes.

Evaluation criteria

When evaluating agent behavior, check the following:

Response quality - Is the text accurate, appropriate, and helpful?
Tool usage - Was the correct tool or subagent called with the correct parameters?
Validation steps - Did it process results correctly before taking actions?

For AI agents, correctness means both textual output and operational behavior. An agent that produces clear output but calls the wrong tool is still non-functional.