Skip to content

Tests

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.

tests:
- id: addition
criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"
FieldRequiredDescription
idYesUnique identifier for the test
criteriaYesDescription of what a correct response should contain
inputYesInput sent to the target (string, object, or message array). Alias: input
expected_outputNoExpected response for comparison (string, object, or message array). Alias: expected_output
executionNoPer-case execution overrides (target, evaluators)
workspaceNoPer-case workspace config (overrides suite-level)
metadataNoArbitrary key-value pairs passed to setup/teardown scripts
rubricsNoStructured evaluation criteria
assertNoPer-test assertion evaluators (alternative to execution.evaluators)
sidecarNoAdditional metadata passed to evaluators

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
- role: system
content: You are a helpful math tutor.
- role: user
content: What is 15 + 27?

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
- role: assistant
content: "42"

Override the default target or evaluators for specific tests:

tests:
- id: complex-case
criteria: Provides detailed explanation
input: Explain quicksort algorithm
execution:
target: gpt4_target
evaluators:
- name: depth_check
type: llm_judge
prompt: ./judges/depth.md

Per-case evaluators are merged with root-level execution.evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level evaluators for a specific test, set skip_defaults: true:

execution:
evaluators:
- name: latency_check
type: latency
threshold: 5000
tests:
- id: normal-case
criteria: Returns correct answer
input: What is 2+2?
# Gets latency_check from root-level evaluators
- id: special-case
criteria: Handles edge case
input: Handle this edge case
execution:
skip_defaults: true
evaluators:
- name: custom_eval
type: llm_judge
# Does NOT get latency_check

Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:

workspace:
setup:
script: ["bun", "run", "default-setup.ts"]
tests:
- id: case-1
criteria: Should work
input: Do something
workspace:
setup:
script: ["bun", "run", "custom-setup.ts"]
- id: case-2
criteria: Should also work
input: Do something else
# Inherits suite-level setup

See Workspace Setup/Teardown for the full workspace config reference.

Pass arbitrary key-value pairs to setup/teardown scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:

tests:
- id: sympy-20590
criteria: Bug should be fixed
input: Fix the diophantine equation bug
metadata:
repo: sympy/sympy
base_commit: "abc123def"
workspace:
setup:
script: ["python", "checkout_repo.py"]

The metadata field is included in the stdin JSON passed to setup and teardown scripts as case_metadata.

The assert field provides a concise way to attach evaluators directly to a test, as an alternative to nesting them inside execution.evaluators. It supports both deterministic assertion types and LLM-based rubric evaluation.

These evaluators run without an LLM call and produce binary (0 or 1) scores:

TypeFieldDescription
containsvalueChecks if output contains the substring
regexvalueChecks if output matches the regex pattern
is_jsonChecks if output is valid JSON
equalsvalueChecks if output exactly equals the value (trimmed)
tests:
- id: json-api
criteria: Returns valid JSON with status field
input: Return the system status as JSON
assert:
- type: is_json
- type: contains
value: '"status"'

Assertion evaluators auto-generate a name when one is not provided (e.g., contains-DENIED, is_json).

Use type: rubrics with a criteria array to define structured LLM-judged evaluation criteria inline:

tests:
- id: denied-party
criteria: Must identify denied party
input:
- role: user
content: Screen "Acme Corp" against denied parties list
expected_output:
- role: assistant
content: "DENIED"
assert:
- type: contains
value: "DENIED"
required: true
- type: rubrics
criteria:
- id: accuracy
outcome: Correctly identifies the denied party
weight: 5.0
- id: reasoning
outcome: Provides clear reasoning for the decision
weight: 3.0

Any evaluator (in assert or execution.evaluators) can be marked as required. When a required evaluator fails, the overall test verdict is fail regardless of the aggregate score.

ValueBehavior
required: trueMust score >= 0.8 (default threshold) to pass
required: 0.6Must score >= 0.6 to pass (custom threshold between 0 and 1)
assert:
- type: contains
value: "DENIED"
required: true # must pass (>= 0.8)
- type: rubrics
required: 0.6 # must score at least 0.6
criteria:
- id: quality
outcome: Response is well-structured
weight: 1.0

Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to fail.

assert and execution.evaluators serve the same purpose — both define evaluators for a test. Choose whichever fits your style:

  • assert: Flat, concise syntax at the test level. Best for assertion-heavy evals.
  • execution.evaluators: Nested under execution. Useful when you also set target or skip_defaults.

When both are present on the same test, assert takes precedence over execution.evaluators. Suite-level assert and execution.evaluators follow the same merge rules — they are appended to per-test evaluators unless skip_defaults: true is set.

Pass additional context to evaluators via the sidecar field:

tests:
- id: code-gen
criteria: Generates valid Python
sidecar:
language: python
difficulty: medium
input: Write a function to sort a list