Stop debugging your LLM by re-reading transcripts

You shipped an LLM product. Engagement is climbing. Then a support ticket lands with a screenshot of a reply you cannot explain, and the only forensic tool you have is scrolling the chat transcript.

That is the moment most teams realise their product is a black box. Five tracing moves fix it. The first one is one line of structured logging at the model boundary — a JSON blob with the prompt hash, the model id, the latency, and a session id — and it alone will resolve more incidents than the next prompt rewrite you are tempted to ship.

By the end of this post you will know:

The five log surfaces that make an LLM debuggable on day one
The metadata fields you regret not adding the first time you skip them
Where human review and automated evals fit, and where they do not
Three predictions about what your tracing setup will look like in six weeks

Why tracing, not logging

Server logs tell you the request happened. Tracing tells you what the model saw, what it decided, and what state the rest of the system was in when it decided. Those are different questions, and you cannot answer the second one from the first.

A probabilistic system does not fail like a deterministic one. The reply on the screenshot is the last thing that happened. Every input that shaped it — the rendered prompt, the retrieved context, the route, the user's prior turn — is sitting in memory at call time and gone the moment the response ships. Tracing is the discipline of writing those inputs down before they disappear.

1. Log the full call, not the reply

Capture every model call as a structured record. At minimum:

{
  "session_id": "s_8f21",
  "turn": 4,
  "model": "claude-sonnet-4-5",
  "prompt_template_id": "router.v3",
  "rendered_prompt_hash": "a1b2c3",
  "input_tokens": 1842,
  "output_tokens": 211,
  "latency_ms": 1320,
  "output": "..."
}

The two fields that earn their keep first are prompt_template_id and rendered_prompt_hash. The template id tells you which version of your prompt ran. The hash tells you whether two calls actually saw the same prompt, which is the question you will be asked the first time a regression appears after a deploy.

Tools like LangSmith handle the capture side. The decision you own is what you put in.

2. Attach metadata that lets you slice traffic

A trace is only useful if you can group it. Attach metadata at the call site, not after the fact:

user_cohort — free, paid, internal, evaluation
route — which branch of your workflow chose this call
experiment — flag name plus variant
client_version — so a client-side bug does not look like a model regression
parent_trace_id — for multi-call workflows where one user turn produces three model calls

You do not need to know which slice will matter. You need every slice to be available the next time someone asks "did this only happen on the new mobile build?"

The field nobody asks for is usually the one that resolves the next incident. Add it once. Move on.

3. Tie traces to outcomes the business cares about

A trace without an outcome is a souvenir. Join your call records to whatever downstream signal you already have: did the user reply, did they complete the workflow, did they churn, did the support ticket get opened.

The join key is usually session_id. The join itself usually lives in your warehouse, not in your tracing tool. Braintrust and similar platforms make this easier, but the discipline is what matters: every model output should be addressable by the business event it caused or failed to cause.

This is what lets you stop arguing about which prompt change felt better and start measuring which one moved the metric you ship against.

4. Make human review cheap

Automated evals do not catch the failure modes you have not named yet. Human review does. The work is to make the review path short:

A queue of traces filtered by metadata — "show me last week's route=cancel calls on the new template"
One-click labels: correct, wrong, ambiguous, plus a free-text note
A path from a labelled trace back into your eval set

If labelling a trace takes more than ten seconds, nobody will do it after the first week. Build the interface before you need it, not after the first incident.

5. Use LLM-as-judge, but validate it

Automated evals scale review. They also lie quietly when the judge prompt drifts from human judgement. Two rules:

Anchor the judge to a labelled set. Periodically score the same traces with the judge and a human. Track agreement. If agreement drops, the judge prompt is broken before your product is.
Keep judges binary. "Did this reply answer the user's question, yes or no" beats "rate this reply 1 to 5" every time. A binary judge is auditable. A 1-5 score is a vibe.

The judge is not a replacement for human review. It is a way to scan a million traces for the hundred that humans should look at.

What you will hit next

Three predictions for any team that turns this on:

Your first tracing dashboard will be unreadable. A view that pleases engineers — token counts, latency percentiles, cache hit rates — does not help the support engineer who needs to find one session by user email. Build a second view for them or they will keep paging you.
You will discover a prompt regression you shipped six weeks ago that nobody noticed. The trace volume on the affected route was too low to show up in aggregate metrics, but the failure rate inside that route was high. Tracing is how that becomes visible. Fix the regression, then add the alert.
The metadata field you almost did not add will solve the next incident. It is usually client_version, experiment, or parent_trace_id. Add them now. They cost nothing in storage and everything in forensics when missing.

How to start this week

Pick one workflow. Add the structured log line from section 1 at every model call inside it. Attach three metadata fields: session_id, route, prompt_template_id. Pipe it to a tool that lets you query by any of them.

You do not need to solve evals, judges, or dashboards on day one. You need every model call in one slow path to be addressable, replayable, and joinable to an outcome. The rest of this list extends from there.

The shift is small and the payoff is structural: the next time a ticket lands, you stop re-reading the transcript and start reading the trace.

If you are debugging an LLM product by scrolling transcripts right now, send me one structured log line from your current setup and I will tell you which two fields you are missing that will save your next incident. paul@paulelliot.co.