Every interview comes back the same way: status Completed, a tidy summary underneath. The call where the expert refused every substantive question and the call where they handed you the whole market structure look identical in the list. Both ran their length, both produced a clean paragraph, both are ticked off.

At five interviews a week you don't have a problem — you read every transcript. At five hundred you can't, and "Completed" quietly stops meaning anything. The weak calls are in there, costing the same as the strong ones, and nothing surfaces them. This is the gap a quality signal has to close, and most attempts close it the wrong way.

Why a fixed rubric doesn't survive contact with real interviews

The obvious move is to write a checklist. Did the expert name competitors? Did the call run past ten minutes? Were there at least five exchanges? Score against that and rank.

It falls apart immediately, because "good" is not one thing. A competitive-landscape interview is meant to surface named vendors and switching costs; a clinical-practice interview is meant to surface treatment pathways and would be worse for naming a competitor. A pricing teardown wants numbers; a culture read wants narrative. The moment you serve more than one kind of study, any fixed rubric is either too generic to be useful or quietly penalises interviews for not doing something they were never meant to do.

The same problem defeats the cruder proxies. Call length rewards rambling. Word count rewards a chatty expert who never answered the question. None of these measure the thing you actually care about, which is whether this conversation achieved what this interview set out to achieve.

So that's the standard we score against. Each interview is judged against its own objective — the brief the interview was actually run to fulfil — rather than a universal idea of a good call. A screening interview is graded on whether it screened. A deep technical interview is graded on whether it got depth. There is no rubric to maintain per study and nothing to configure: the definition of "good" is read from what the interview was already set up to do.

Never just a number

A score on its own is an argument-starter. The interview owner sees a 38, disagrees, and now you're relitigating a number with no shared ground.

So the score never travels alone. Every one comes with a short, plain-language explanation of why — what the interview was after, what it got, and where it fell short. A low score reads like "the expert declined to discuss vendor selection and the call ended before pricing was covered," not just a red 38. That turns the score from a verdict into a starting point: an owner can glance at the reason, agree or override, and move on. It also keeps the metric honest. A score you can't explain is a score you shouldn't trust.

Multi-step interviews aren't an average

Plenty of interviews aren't a single block. A short screening or profiling step hands off to the main line of questioning; sometimes the conversation is routed to a specialist track partway through. If you score those naively, you average the parts — and a flawless two-minute screening step props up a main interview that went nowhere.

That's backwards. The screening did its job; the interview still failed. So the overall score is weighted toward substance: the parts of the conversation whose job was to gather insight drive the number, and the connective steps — greeting, profiling, routing — neither inflate nor deflate it. A perfect hand-off in front of a hollow interview still scores like a hollow interview. Where it's useful, you can also see the breakdown by step, so "the screening was fine, the main interview stalled" is visible at a glance rather than buried in an average.

A good interview and a useful one aren't the same thing

There's a deeper conflation hiding inside a single quality number — and it's the one that trips up what you do next. A score for "how much insight did this produce" quietly blends two things an operations team needs to keep apart: how well the AI interviewer ran the conversation, and how much the expert was willing or able to give it.

They come apart all the time. A sharp interviewer can probe perfectly and still come away with little, because the expert guarded the specifics or ended the call early. A generous, talkative expert can flatter an interviewer that never actually pushed. Two interviews can land on the same middling score for opposite reasons — one because the agent under-probed a strong expert, the other because the agent did everything right and a cautious expert simply wouldn't go deeper. The first is a fixable coaching problem; the second isn't a problem at all.

So we score them separately. Interview Quality is the outcome — how much usable insight the conversation produced. Agent Conduct is the craft — did the interviewer ask sharp questions, chase thin answers, follow up, and handle a refusal gracefully — judged independently of how forthcoming the expert turned out to be. A skilful interviewer who meets a guarded expert scores high on conduct and lower on quality; one that lets a willing expert off the hook scores the reverse.

The gap between the two is the part you act on. High conduct with low quality says the interview was run well and the expert or the topic capped what was there — accept it and move on. Low conduct says the interview itself left value on the table, which is a repeatable thing you can improve. One number tells you a call was weak; two tell you whose problem it is.

One more piece of context rides along. When a call didn't actually finish — the expert dropped, hung up, or cut it short before the substance — the interview is flagged as incomplete. A low score then reads as a call that never got its chance rather than an interviewer who fumbled a good one, which is the difference between sourcing a replacement and coaching the agent.

What changes when quality is visible

Once every interview carries a score and a reason, the day-to-day shape of the work changes.

Triage instead of read-everything. You stop reading transcripts to find the duds and start with the ones the score flags. Sort the batch, pull everything below a threshold into one view, and spend your attention where it's warranted.

A re-run decision you can defend. A weak score with a clear reason — refused the core questions, cut short, never reached the brief — is the evidence for sourcing a replacement, and the evidence for not wasting a slot re-running a call that was always going to be thin.

A feel for the study, not just the call. Scores in aggregate tell you whether a line of questioning is landing across experts or consistently falling flat — a signal about the interview design itself, visible long before you'd have spotted it by hand.

None of this is about grading experts or replacing the judgement of the people who run the study. It's about giving an operations team a reliable, glanceable signal where there was previously only "Completed" — so that at five hundred interviews a month, the word still means something.

That's the whole point of running interviews intelligently: not just getting the call done, but knowing which of them delivered.

A Completed Interview Isn't the Same as a Useful One

Why a fixed rubric doesn't survive contact with real interviews

Never just a number

Multi-step interviews aren't an average

A good interview and a useful one aren't the same thing

What changes when quality is visible

Related Articles

One Interview Script Can't Do Justice to Every Expert

Why Specialist Expert Networks Are Quietly Outpacing the Incumbents

The Hidden Cost of Expert Calls: Why Small Teams Can't Scale (And How to Fix It)

The Expert Network Is Becoming Infrastructure

Best Expert Network Software in 2026: Platforms Compared

Ready to transform your expert interviews?