Bootstrap your AI evals before you have a single user

You have an AI product idea. You have no users. You need a test set to know whether the thing works, and you cannot get a test set without users. So the project stalls at "looks fine in the demo."

This is the chicken-and-egg problem of AI development, and it has a one-file solution: generate the test set yourself, along the dimensions that matter for your product.

By the end of this post you will have working Gemini 2.0 Flash code that does three things end-to-end:

Generates diverse synthetic user queries across feature, scenario, and persona dimensions
Generates AI responses to those queries
Runs an LLM-as-judge over the pairs with few-shot expert critiques

Think in dimensions, not examples

The mistake most teams make is writing 20 test prompts by hand and calling it an eval set. Twenty hand-written prompts pattern-match too closely to whatever was on your mind the day you wrote them.

Generate along three dimensions instead:

Features — what capabilities the product needs to support
Scenarios — what situations the product will encounter
Personas — who is using it and how

For a physician scheduling assistant:

Features: appointment booking, rescheduling, cancellation, availability checking
Scenarios: routine scheduling, urgent request, conflicting appointment, after-hours request
Personas: primary care physician, specialist, nurse, admin staff, patient coordinator

The cross-product gives you 80 distinct test inputs from 13 dimension values. Every combination forces a different kind of query out of the generator.

The dimensions you pick on day one will outlive the product. Treat them as a design decision, not a quick list.

Generate queries, not answers

The rule: use the LLM to generate user inputs, not the expected outputs. If you let the generator write both sides, your eval inherits the generator's biases and tells you nothing about your own system.

Configure Gemini, then define one function that turns a dimension triple into a query:

import google.generativeai as genai
import pandas as pd
import itertools
from typing import List, Dict, Any

genai.configure(api_key=api_key)

def generate_synthetic_query(
    feature: str,
    scenario: str,
    persona: str,
    model_name: str = "gemini-2.0-flash"
) -> Dict[str, Any]:
    """Generate a realistic user query for the given dimension triple."""

    prompt = f"""
    Generate a realistic user query for a physician scheduling assistant.

    FEATURE: {feature}
    SCENARIO: {scenario}
    PERSONA: {persona}

    Guidelines:
    1. Natural language, as typed by a medical professional or staff member
    2. Include medical context and scheduling details that fit the persona
    3. Realistic: abbreviations, terminology, incomplete information where appropriate
    4. Urgent requests realistic, not extreme
    5. Fictional patient names and appointment types
    6. Output the query as plain text only. No disclaimers.
    """

    model = genai.GenerativeModel(model_name)
    response = model.generate_content(prompt)

    return {
        "feature": feature,
        "scenario": scenario,
        "persona": persona,
        "query": response.text.strip(),
    }

Then iterate the cross-product:

def generate_synthetic_dataset(
    features: List[str],
    scenarios: List[str],
    personas: List[str],
    samples_per_combination: int = 1,
) -> pd.DataFrame:
    rows = []
    for feature, scenario, persona in itertools.product(features, scenarios, personas):
        for _ in range(samples_per_combination):
            rows.append(generate_synthetic_query(feature, scenario, persona))
    return pd.DataFrame(rows)

One sample per combination is enough to start. You will regret oversampling before you have looked at the first batch.

Generate the AI responses you actually want to evaluate

Once you have queries, run them through the system under test. Here the system is a single Gemini prompt — in your product it would be your actual agent, RAG pipeline, or workflow:

def generate_ai_response(
    query: str,
    feature: str,
    scenario: str,
    persona: str,
    model_name: str = "gemini-2.0-flash",
) -> str:
    prompt = f"""
    You are a physician scheduling assistant. Respond to the following user query.

    USER QUERY: "{query}"

    CONTEXT:
    - Feature: {feature}
    - Scenario: {scenario}
    - User role: {persona}

    SCHEDULING SYSTEM:
    - Slots: 9:00 AM - 5:00 PM, Monday-Friday
    - Durations: 15 min (quick follow-up), 30 min (standard), 60 min (new patient/complex)
    - Emergency slots: 11:30 AM and 3:30 PM daily
    - Current time: Monday, 10:00 AM

    Respond in under 150 words. No meta-commentary.
    """
    model = genai.GenerativeModel(model_name)
    return model.generate_content(prompt).text.strip()

synthetic_data["ai_response"] = synthetic_data.apply(
    lambda row: generate_ai_response(
        row["query"], row["feature"], row["scenario"], row["persona"]
    ),
    axis=1,
)

Now every row in the dataframe is a complete interaction: dimensions, query, response. That is the artefact you grade.

Judge with few-shot critiques, not a vibe rubric

A naive LLM judge — "rate this response from 1 to 5" — is noise. The judge that holds up uses binary pass/fail with expert critiques as few-shot examples. You write the critiques once, by hand, for three to five examples. Mix pass and fail. The model then has a calibrated sense of what each verdict means in your domain.

EXPERT_EXAMPLES = [
    {
        "query": "Book Mrs. Eleanor Vance for her annual physical with Dr. Ramirez next week, AM preferred. Also a BP follow-up for Mr. David Rossi two weeks out, any afternoon.",
        "response": "Scheduled Mrs. Vance next Tuesday 9:00 AM with Dr. Ramirez. Mr. Rossi booked two weeks from today at 2:00 PM. Both confirmed.",
        "critique": "Addresses both requests with specific times matching the stated preferences. Confirms scheduling. Missing: appointment duration and confirmation step before finalising.",
        "judgment": "PASS",
    },
    {
        "query": "Reschedule Sarah Johnson's cardiology follow-up from next Friday to the following week. She prefers mornings.",
        "response": "I'll look into rescheduling. What time was her original appointment on Friday?",
        "critique": "Asks for information the scheduling system already has. Should locate the existing appointment and propose specific morning slots for the following week.",
        "judgment": "FAIL",
    },
]

def evaluate_ai_response(
    query: str, response: str, feature: str, scenario: str, persona: str,
    model_name: str = "gemini-2.0-flash",
) -> dict:
    examples = "\n\n".join(
        f"USER QUERY: \"{e['query']}\"\nASSISTANT RESPONSE: \"{e['response']}\"\n"
        f"CRITIQUE: {e['critique']}\nJUDGMENT: {e['judgment']}"
        for e in EXPERT_EXAMPLES
    )

    prompt = f"""
    You are an expert evaluator for a physician scheduling assistant.

    {examples}

    Now evaluate:
    USER QUERY: "{query}"
    ASSISTANT RESPONSE: "{response}"
    CONTEXT: feature={feature}, scenario={scenario}, persona={persona}

    Write a critique, then a binary PASS or FAIL judgment.
    Output JSON: {{"critique": "...", "judgment": "PASS" or "FAIL", "improvement_suggestions": "..."}}
    """
    model = genai.GenerativeModel(model_name)
    return json.loads(model.generate_content(prompt).text.strip().strip("`").lstrip("json"))

Once this loop runs, you have a pass rate. That number is the start of an actual feedback loop — not a vanity metric you compute once.

What you will hit next

Three predictions for teams that adopt this on day one:

Your first 100 synthetic queries will sound like your prompt. The generator will pattern-match the phrasing in the instructions. The fix is persona diversity, not feature diversity — add real-sounding voices ("rushed admin who types in fragments", "specialist who dictates") as a fourth dimension and the queries stop sounding like a textbook.
Your LLM judge will agree with itself more than it agrees with you. Before you trust the pass rate, label 30 examples by hand and compare. If judge-vs-human agreement is under 80%, your few-shot critiques are not specific enough. Rewrite the critiques, not the rubric.
The dimensions you pick on day one will outlive the product. They become the schema your error analysis, your dashboards, and your regression tests all hang off. Pick them deliberately. Then, when you have real traffic, slice the failures along the same axes and you will know exactly which feature × scenario × persona cell is broken.

A production-grade loop adds dimension-based error analysis, agreement tracking against human labels, and iterative refinement of the critique examples. That is the next playbook. Get to a working pass rate first.

Start before you have users

The point of synthetic data is not to replace real users. It is to give you a measurable system to ship to those users. Without an eval, every prompt change is a guess. With one, you have a number that moves.

Pick three feature dimensions, three scenario dimensions, three persona dimensions. Generate 27 queries. Run them through your current pipeline. Hand-grade 10. You now have more signal than most AI products have at launch.

If you are building an AI product without users yet, send me your three feature dimensions, three scenario dimensions, and three persona dimensions, and I will tell you which combinations are missing from your eval set. paul@paulelliot.co.