Tests

Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.

Basic Structure

tests:
  - id: addition
    criteria: Correctly calculates 15 + 27 = 42

    input: What is 15 + 27?

    expected_output: "42"

Fields

Field	Required	Description
`id`	Yes	Unique identifier for the test
`criteria`	Yes	Description of what a correct response should contain
`input`	Yes	Input sent to the target (string, object, or message array). Alias: `input`
`expected_output`	No	Expected response for comparison (string, object, or message array). Alias: `expected_output`
`execution`	No	Per-case execution overrides (target, evaluators)
`workspace`	No	Per-case workspace config (overrides suite-level)
`metadata`	No	Arbitrary key-value pairs passed to setup/teardown scripts
`rubrics`	No	Structured evaluation criteria
`assert`	No	Per-test assertion evaluators (alternative to `execution.evaluators`)
`sidecar`	No	Additional metadata passed to evaluators

Input

The simplest form is a string, which expands to a single user message:

input: What is 15 + 27?

For multi-turn or system messages, use a message array:

input:
  - role: system
    content: You are a helpful math tutor.
  - role: user
    content: What is 15 + 27?

Expected Output

Optional reference response for comparison by evaluators. A string expands to a single assistant message:

expected_output: "42"

For structured or multi-message expected output, use a message array:

expected_output:
  - role: assistant
    content: "42"

Per-Case Execution Overrides

Override the default target or evaluators for specific tests:

tests:
  - id: complex-case
    criteria: Provides detailed explanation
    input: Explain quicksort algorithm

    execution:
      target: gpt4_target
      evaluators:
        - name: depth_check
          type: llm_judge
          prompt: ./judges/depth.md

Per-case evaluators are merged with root-level execution.evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level evaluators for a specific test, set skip_defaults: true:

execution:
  evaluators:
    - name: latency_check
      type: latency
      threshold: 5000

tests:
  - id: normal-case
    criteria: Returns correct answer
    input: What is 2+2?
    # Gets latency_check from root-level evaluators

  - id: special-case
    criteria: Handles edge case
    input: Handle this edge case
    execution:
      skip_defaults: true
      evaluators:
        - name: custom_eval
          type: llm_judge
      # Does NOT get latency_check

Per-Case Workspace Config

Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:

workspace:
  setup:
    script: ["bun", "run", "default-setup.ts"]

tests:
  - id: case-1
    criteria: Should work
    input: Do something
    workspace:
      setup:
        script: ["bun", "run", "custom-setup.ts"]

  - id: case-2
    criteria: Should also work
    input: Do something else
    # Inherits suite-level setup

See Workspace Setup/Teardown for the full workspace config reference.

Per-Case Metadata

Pass arbitrary key-value pairs to setup/teardown scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:

tests:
  - id: sympy-20590
    criteria: Bug should be fixed
    input: Fix the diophantine equation bug
    metadata:
      repo: sympy/sympy
      base_commit: "abc123def"
    workspace:
      setup:
        script: ["python", "checkout_repo.py"]

The metadata field is included in the stdin JSON passed to setup and teardown scripts as case_metadata.

Per-Test Assertions

The assert field provides a concise way to attach evaluators directly to a test, as an alternative to nesting them inside execution.evaluators. It supports both deterministic assertion types and LLM-based rubric evaluation.

Deterministic Assertions

These evaluators run without an LLM call and produce binary (0 or 1) scores:

Type	Field	Description
`contains`	`value`	Checks if output contains the substring
`regex`	`value`	Checks if output matches the regex pattern
`is_json`	—	Checks if output is valid JSON
`equals`	`value`	Checks if output exactly equals the value (trimmed)

tests:
  - id: json-api
    criteria: Returns valid JSON with status field
    input: Return the system status as JSON
    assert:
      - type: is_json
      - type: contains
        value: '"status"'

Assertion evaluators auto-generate a name when one is not provided (e.g., contains-DENIED, is_json).

Rubric Assertions

Use type: rubrics with a criteria array to define structured LLM-judged evaluation criteria inline:

tests:
  - id: denied-party
    criteria: Must identify denied party
    input:
      - role: user
        content: Screen "Acme Corp" against denied parties list
    expected_output:
      - role: assistant
        content: "DENIED"
    assert:
      - type: contains
        value: "DENIED"
        required: true
      - type: rubrics
        criteria:
          - id: accuracy
            outcome: Correctly identifies the denied party
            weight: 5.0
          - id: reasoning
            outcome: Provides clear reasoning for the decision
            weight: 3.0

Required Gates

Any evaluator (in assert or execution.evaluators) can be marked as required. When a required evaluator fails, the overall test verdict is fail regardless of the aggregate score.

Value	Behavior
`required: true`	Must score >= 0.8 (default threshold) to pass
`required: 0.6`	Must score >= 0.6 to pass (custom threshold between 0 and 1)

assert:
  - type: contains
    value: "DENIED"
    required: true          # must pass (>= 0.8)
  - type: rubrics
    required: 0.6           # must score at least 0.6
    criteria:
      - id: quality
        outcome: Response is well-structured
        weight: 1.0

Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to fail.

Assert vs execution.evaluators

assert and execution.evaluators serve the same purpose — both define evaluators for a test. Choose whichever fits your style:

assert: Flat, concise syntax at the test level. Best for assertion-heavy evals.
execution.evaluators: Nested under execution. Useful when you also set target or skip_defaults.

When both are present on the same test, assert takes precedence over execution.evaluators. Suite-level assert and execution.evaluators follow the same merge rules — they are appended to per-test evaluators unless skip_defaults: true is set.

Sidecar Metadata

Pass additional context to evaluators via the sidecar field:

tests:
  - id: code-gen
    criteria: Generates valid Python
    sidecar:
      language: python
      difficulty: medium
    input: Write a function to sort a list