Evaluator

The Evaluator block uses AI to score and assess content quality against custom metrics. Perfect for quality control, A/B testing, and ensuring AI outputs meet specific standards.

Configuration Options

Evaluation Metrics

Define custom metrics to evaluate content against. Each metric includes:

Name: A short identifier for the metric
Description: A detailed explanation of what the metric measures
Range: The numeric range for scoring (e.g., 1-5, 0-10)

Example metrics:

Accuracy (1-5): How factually accurate is the content?
Clarity (1-5): How clear and understandable is the content?
Relevance (1-5): How relevant is the content to the original query?

Content

The content to be evaluated. This can be:

Directly provided in the block configuration
Connected from another block's output (typically an Agent block)
Dynamically generated during workflow execution

Model Selection

Choose an AI model to perform the evaluation:

OpenAI: GPT-4o, o1, o3, o4-mini, gpt-4.1
Anthropic: Claude Sonnet 4.5
Google: Gemini 2.5 Pro, Gemini 2.0 Flash
Other Providers: Groq, Cerebras, xAI, DeepSeek
Local Models: Ollama or VLLM compatible models

Use models with strong reasoning capabilities like GPT-4o or Claude Sonnet 4.5 for best results.

API Key

Your API key for the selected LLM provider. This is securely stored and used for authentication.

Example Use Cases

Content Quality Assessment - Evaluate content before publication

Agent (Generate) → Evaluator (Score) → Condition (Check threshold) → Publish or Revise

A/B Testing Content - Compare multiple AI-generated responses

Parallel (Variations) → Evaluator (Score Each) → Function (Select Best) → Response

Customer Support Quality Control - Ensure responses meet quality standards

Agent (Support Response) → Evaluator (Score) → Function (Log) → Condition (Review if Low)

Outputs

<evaluator.content>: Summary of the evaluation with scores
<evaluator.model>: Model used for evaluation
<evaluator.tokens>: Token usage statistics
<evaluator.cost>: Estimated evaluation cost

Best Practices

Use specific metric descriptions: Clearly define what each metric measures to get more accurate evaluations
Choose appropriate ranges: Select scoring ranges that provide enough granularity without being overly complex
Connect with Agent blocks: Use Evaluator blocks to assess Agent block outputs and create feedback loops
Use consistent metrics: For comparative analysis, maintain consistent metrics across similar evaluations
Combine multiple metrics: Use several metrics to get a comprehensive evaluation

Common Questions

The Evaluator returns a JSON object where each key is the lowercase version of your metric name and the value is a numeric score within the range you defined. For example, a metric named 'Accuracy' with range 1-5 would appear as { "accuracy": 4 } in the output.

Models with strong reasoning capabilities produce the most consistent evaluations. GPT-4o and Claude Sonnet are recommended. The default model is Claude Sonnet 4.5.

The content field accepts any string input. If you pass JSON or structured data, the Evaluator will automatically detect and format it before evaluation. However, the evaluation is text-based — it cannot directly evaluate images or audio.

Metrics missing a name or range are automatically filtered out. The Evaluator only scores metrics that have both a valid name and a defined min/max range.

Yes. The Evaluator generates a JSON Schema response format based on your metrics and enforces strict mode, so the LLM is constrained to return only the expected metric scores as numbers — no extra text or explanations.

Costs are based on the token usage of the underlying LLM call. The Evaluator outputs include token counts (prompt, completion, total) and cost breakdown (input, output, total) so you can track spending per evaluation.

Common Questions

On this page