Verbalised Sampling: a simple way to recover model variety

Reinforcement Learning from Human Feedback (RLHF) has a quiet side effect that matters more than people admit: it nudges models towards the safest, most typical answer.

That is fine if you want clean prose. It is a problem if you want range, originality, or a less obvious line of reasoning.

The reason is simple. Human annotators usually reward text that feels familiar, fluent, and “reasonable”. So over time, the model learns that the middle of the distribution is safer than the edges. The result is not just bland writing. It is often mode collapse: the model keeps returning the most average version of a response, even when the prompt could support much more variety.

Verbalized Sampling is a neat workaround because it does not need retraining.

Instead of asking for one answer, you ask the model to generate several candidate responses and attach estimated probabilities to them. That small change matters. It forces the model to surface more of its pre-training distribution, rather than only the single response that alignment has made most rewarding.

For example, if you ask a model:
“Give me three ways to position this product for Indian enterprise buyers, with probability estimates for each,”
VS may surface one conservative option, one price-led option, and one trust-led option. A plain RLHF-trained response often collapses to the safest general statement. VS makes the alternatives visible.

Another example: if the prompt is about debugging a flaky production issue, a standard answer may jump straight to logging, retries, or timeouts. With VS, you may get a more unusual but valid line of thought alongside the obvious one — say, state contamination, dependency drift, or a hidden rollout issue — with the model showing its confidence in each path.

A third example is creative writing. Ask for three opening lines for the same article, each with probabilities. The usual aligned model tends to produce one polished but familiar opener. VS can reveal a more direct line, a more analytical one, and one that is slightly unexpected but stronger.
In practice, that can recover answers that are:

  • more creative
  • more logically varied
  • less over-smoothed
  • less locked into the “safe” middle

The trade-off is that you are now working with a model that has to expose its uncertainty more honestly. That is useful, but only if you are prepared to sort, compare, and choose. VS is not a magic fix. It is a way of getting past the alignment filter when the default answer is too conservative to be useful.

For teams building with LLMs, that distinction matters. Sometimes the best answer is not the most polished one. It is the one that was suppressed by optimisation in the first place.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top