← Back to blog
· 14 min · Ilyas Baba

Reduce LLM Hallucinations: 7 Techniques That Ship

7 techniques to cut LLM hallucinations in production sales agents, ranked by ROI. Vectara's HHEM benchmark shows 1.3%-15% rates across frontier models.

spoke ai-agent engineering hallucinations reliability
Reduce LLM Hallucinations: 7 Techniques That Ship

If you ship a B2B sales agent that hallucinates one prospect’s job title, you bought a refund. If it hallucinates a customer’s purchase history into a cold email, you bought a churn event. Hallucinations are not a research curiosity for production agent builders, they are a unit-economics problem. This guide ranks the 7 techniques that actually move the needle in production, with the cost-to-impact tradeoffs, and explains why the famous “multi-agent debate” paper from Bengio et al. is the wrong first move for a B2B sales agent.


TL;DR

Decision rule: ground first, verify second, debate almost never. Multi-agent debate is the last technique you should add, not the first.

The 7 techniques, ranked by ROI for production sales agents:

  1. Grounding via retrieval (RAG done right) — biggest single drop in hallucination rate.
  2. Tool-use loops with verification — turn “what’s the email?” into a tool call, not a guess.
  3. Few-shot examples in the system prompt — cheap, fast, surprisingly effective.
  4. Confidence calibration via log probs — know when the model is bluffing.
  5. Constitutional AI / output filtering — catch the regression before it ships.
  6. Human-in-the-loop on destructive actions — last-mile insurance.
  7. Eval-driven prompt iteration — the only technique that compounds.

Frontier model hallucination rates on summarization vary from 1.3% to 15.8% depending on the model, per the Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard, 2024. Architecture choices matter more than model choice past a certain threshold.


Why “multi-agent debate” is overrated for sales agents

The original Du et al. “Improving Factuality and Reasoning in Language Models through Multiagent Debate” paper, 2023, out of MIT and Google, showed measurable accuracy gains on math and reasoning benchmarks when three model instances argued before producing a final answer. The paper is rigorous. The conclusion is real. The production applicability for B2B sales agents is much weaker than the citation count suggests.

Three reasons debate fails in production sales. First, latency triples. A 4-second response becomes 12 seconds, which kills conversational UX in chat and breaks real-time outreach loops. Second, token cost triples at minimum, often 5x once you add a moderator. Third, the agreement-driven failure mode is real: when all three model instances share training data, they confidently agree on the same hallucination. Debate amplifies consensus, not truth.

Debate has its place. Use it for offline batch analysis where latency does not matter and the cost of being wrong is high. For real-time sales workflows, the other six techniques on this list give you better factuality per dollar.


Technique 1: Grounding via retrieval (RAG basics done right)

Retrieval-augmented generation is the single highest-leverage hallucination fix you can ship. A well-tuned RAG layer can reduce hallucination rates on factual queries by 71% on average, per Galileo’s “RAG Hallucination Benchmark” report, 2024. The catch: most RAG implementations in the wild are not well-tuned. They retrieve the wrong chunks, then the model dutifully generates something that sounds plausible.

Three things separate the production RAG that works from the demo RAG that does not.

Chunk your data semantically, not by token count

Splitting documents into 512-token chunks is the easiest path and the worst output. A CRM note about a prospect gets split in the middle of “they signed up in March of last year for the trial, then expanded to” and the agent retrieves only the first half. Use semantic chunking, paragraph-aware splits, or structured chunks per CRM field. Your retrieval recall jumps without touching the model.

Use hybrid search, not pure vector

Vector search alone misses exact-match queries on names, IDs, and dollar amounts. Hybrid search (BM25 plus dense retrieval, with reciprocal rank fusion) catches both fuzzy semantic queries and exact-match lookups. Anthropic’s “Contextual Retrieval” research, Sep 2024 showed hybrid approaches reduced retrieval failures by 49% over plain embeddings.

Cite the source in the response

Force the model to include the chunk ID or document URL in its output. This does two things: it gives you a verifiable trail for QA, and it makes the model more conservative because it has to point to where the claim came from. If it cannot cite, it should say “I do not have that information” instead of inventing it.


Technique 2: Tool-use loops with verification

If the model needs a fact, do not let it guess. Give it a tool that fetches the fact. A 2024 OpenAI evaluation on tool-augmented agents showed structured function calling reduced fabricated entity rates significantly compared to pure prompt-based extraction.

The pattern for a B2B sales agent looks like this:

User: "Send a follow-up to Sarah at Acme about the Q4 demo."

Naive agent: drafts email, hallucinates Sarah's last name and email address.

Tool-using agent:
 → call lookup_contact({company: "Acme", first_name: "Sarah"})
 ← returns [{name: "Sarah Chen", email: "[email protected]", last_contact: "..."}]
 → draft email with verified data
 → call verify_email_deliverable("[email protected]")
 ← returns {deliverable: true, mx_record_valid: true}
 → send

The verification step is the part most builders skip. After the tool returns data, run a quick second LLM pass that checks: did the response use only fields that the tool actually returned, or did it add fields the tool did not include? If it added unverified fields, regenerate. This is also called “self-consistency checking” in the research literature.

The cost is one extra LLM call per turn. The benefit is killing entire classes of hallucination at the source.


Technique 3: Few-shot examples in the system prompt

Few-shot prompting is unfashionable in the era of zero-shot frontier models. It is also one of the most cost-effective hallucination reducers you can deploy. Microsoft Research’s “In-Context Learning” survey, 2023 catalogues the substantial accuracy gains from well-chosen examples, and the gains are largest precisely on edge-case factual queries.

For a sales agent, your few-shot examples should cover the failure modes you actually see in production. Three categories matter most.

Refusal patterns

Show the model what “I do not know” looks like in your domain. Example:

User: "What's the budget approval process at Acme?"
Assistant: "I don't have visibility into Acme's internal budget process.
I can pull the data we have on their last purchase order if helpful."

Without this example, the model invents an approval chain. With it, the model defaults to honesty.

Tool-call patterns

Show the exact format of the tools you want the agent to use, with one positive and one negative example. Models trained on broad function-calling traces sometimes regress to plausible-but-wrong tool argument structures. A single in-context demonstration locks the schema in.

Output format patterns

If the agent should produce JSON with specific fields, demonstrate it. If the agent should email in your brand voice, paste an example email that is exactly the tone you want, with all signature elements. The model imitates examples far more reliably than it follows abstract instructions.

Keep your few-shot block under 1,500 tokens for cost efficiency. Rotate examples weekly based on your eval suite. Examples are an asset, not a one-time write.


Technique 4: Confidence calibration via log probs

Most production agents act on every model output with equal confidence. They should not. Modern API providers expose token-level log probabilities that tell you how sure the model was about each token it generated. Use them.

OpenAI’s API exposes logprobs on the chat completion endpoint. Anthropic exposes confidence signals through tool-use return paths. The pattern: after generation, compute the geometric mean of token probabilities across the answer span. Below a threshold (calibrate per use case, typically 0.7 to 0.85), trigger one of three actions:

  1. Retry with a different temperature or a stronger model.
  2. Escalate to a human-in-the-loop step.
  3. Refuse with a “I am not confident enough to answer” response.

A 2023 paper from DeepMind on calibration in LLMs showed that selective prediction (refusing low-confidence outputs) substantially improves the precision of model responses at the cost of coverage. For a sales agent, this tradeoff is exactly the one you want: better to skip a low-confidence reply than to send a hallucinated one to a paying prospect.

Two operational notes. Log probs are computed at the token level, so a single low-probability token in a long answer can drag your aggregate score down for no good reason. Use a percentile cutoff (e.g., 10th percentile probability) rather than mean for more stable signals. And cache the threshold per task type: a creative subject-line draft tolerates more uncertainty than a CRM data lookup.


Technique 5: Constitutional AI / output filtering

Anthropic’s “Constitutional AI” research, 2022 introduced the pattern of running a second pass over generated content to check it against a written set of rules. In production, this is also called output filtering or guardrail layers, and it is one of the cheapest insurance policies you can buy.

For a sales agent, a useful constitution is short and specific:

  • The response must not contain dollar amounts unless they were returned by a tool call in this turn.
  • The response must not name a person unless that person appears in the retrieved context.
  • The response must not promise a deliverable date.
  • The response must not claim a product capability beyond a fixed approved list.

Run each generated response through a fast classifier (a small LLM call, or a fine-tuned classifier model) that checks compliance. If the response violates a rule, regenerate or escalate.

The overhead is one cheap model call per response, typically under 200ms with a small fast model like Haiku, GPT-4 mini, or a hosted Llama 3 8B. The catch is calibration: write your rules in plain English with explicit examples of what counts as a violation. Vague rules (“be honest”) produce inconsistent filtering. Specific rules (“do not state a price that is not in the retrieved context”) catch the bugs that ship.


Technique 6: Human-in-the-loop on destructive actions

Some actions an agent takes are reversible. Drafting an email is reversible up to the moment you press send. Updating a CRM note is mostly reversible. Sending a cold message to a named-account VP is not, in any meaningful business sense, reversible.

The pattern is binary: classify every action your agent can take as reversible or destructive, and require human approval for the destructive class. For B2B sales, the destructive list typically includes outbound messages to named accounts, calendar invitations to external recipients, CRM updates that change deal stage or amount, and any spend over a configured threshold.

The approval UX matters more than the policy. If approvals are buried in an email queue, the human bottleneck kills the agent’s throughput, and operators disable the human-in-the-loop step within two weeks. Build the approval surface where the operator already lives, typically Slack, with one-click approve, edit, or kill. Auto-approve after a configurable window for low-stakes accounts. Hard-block for VIP segments.

A practical implementation lesson worth flagging: instrument the approval-rate as a first-class metric alongside conversion. If your approval rate is above 95% on every action type, the human is rubber-stamping and you should expand the autonomous surface. If it is below 70%, the agent is producing too many low-quality outputs and you have a deeper problem at one of the earlier layers (data, retrieval, or prompting).

For an applied version of this pattern in production, see our walkthrough on reducing no-shows with an AI agent and Twilio, where the human-in-the-loop checkpoint sits between the agent’s draft and the outbound SMS.


Technique 7: Eval-driven prompt iteration

You cannot improve what you do not measure. An eval suite is the only technique on this list that compounds over time. Every other technique gives you a one-time improvement. Evals give you a flywheel.

Hamel Husain’s “Your AI Product Needs Evals” essay, 2024 makes the case better than anyone: build the eval suite before you ship the prompt, and treat every production failure as a new test case. The discipline is more important than the tooling.

A minimum eval suite for a sales agent has three layers.

Layer 1: Unit-test-style cases

20 to 50 deterministic prompts with expected outputs, run on every prompt change. Examples: “When the user asks for a price that is not in context, the agent refuses.” “When the lookup_contact tool returns no results, the agent does not invent a fallback contact.”

Layer 2: LLM-as-judge on production samples

Sample 1% of production responses daily and run them through a stronger model (Claude Sonnet, GPT-4 class) with a structured rubric: “Does this response cite a verified source for every factual claim? Score 1 to 5 with reasoning.” Track the score over time. A regression here is a signal that your retrieval or prompting has drifted.

Layer 3: Human-labeled golden set

50 to 100 hand-labeled responses, refreshed quarterly, with explicit pass-or-fail criteria. This is the slow, expensive, irreplaceable layer. Use it to validate that LLM-as-judge is itself calibrated.

The cost is real, you will spend engineering time you would rather spend on features. The benefit is that you can change prompts, swap models, and refactor agent loops without fear. Without evals, you are guessing whether the change you just shipped made things better or worse, and the answer in production is usually “worse, but only on the cases you do not look at.”


How to stack them in production

The 7 techniques are not a menu where you pick one. They are layers, and they compound. The recommended stacking order, by ROI per engineering week:

  1. Week 1: ship grounding (Technique 1) and tool-use loops (Technique 2). This eliminates the worst hallucination class (made-up facts) at the source.
  2. Week 2: add few-shot examples (Technique 3) and a basic eval suite (Technique 7, layer 1). Now you can measure progress.
  3. Week 3: layer in output filtering (Technique 5) for your domain’s specific failure modes.
  4. Week 4 onward: add confidence calibration (Technique 4) for selective prediction, and human-in-the-loop checkpoints (Technique 6) on destructive actions. Expand evals (Technique 7, layers 2 and 3).
  5. Only if needed: consider multi-agent debate for offline analysis tasks where latency is not a constraint.

The order matters because each technique covers a different failure mode. Grounding fixes “made-up facts.” Tool-use fixes “made-up entity attributes.” Few-shot fixes “wrong output format.” Calibration fixes “confident bluffing.” Filtering fixes “policy violations slip through.” Human-in-the-loop fixes “destructive actions on hallucinated context.” Evals fix “we cannot tell if we are improving.”

Skip a layer and you accept the corresponding failure mode in production. Some teams do this knowingly (the cost-quality tradeoff is real). Most teams do it unknowingly, and then ship an agent that hallucinates in front of a prospect.


Where Tasmela applies these techniques

Tasmela provisions a dedicated AI agent for each customer that runs on a per-tenant Hetzner instance, with the OpenClaw agent stack handling tool-use loops, integration-level grounding (CRM, LinkedIn via a compliant relay, Google Workspace, Slack), and human-in-the-loop checkpoints on outbound messages. The LLM is swappable at runtime via OpenRouter, which lets you route low-stakes tasks to cheaper models and high-stakes tasks to frontier models without rewriting your agent. Calibration thresholds and few-shot examples are configured per instance.

The product is not a frameworks-and-glue toolkit, it is a configured stack that ships the first 5 techniques out of the box and lets you add evals and custom output filters on top. If you are evaluating whether to build or buy, the /tarifs page has the credit structure and the integration matrix.


FAQ

What is the difference between hallucination and bullshit in an LLM context?

In the Hicks, Humphries and Slater paper “ChatGPT is bullshit”, 2024, the authors argue LLM errors are better described as bullshit (indifference to truth) than hallucination (false perception). For builders, the distinction matters: hallucination implies the model can be “cured,” bullshit implies you need external grounding to inject truth. The 7 techniques in this guide all assume the bullshit framing: do not trust the model to know what is true, give it access to what is true.

Do reasoning models like o1 and Claude 3.7 hallucinate less?

Mixed evidence. Per Vectara’s HHEM leaderboard, 2024, reasoning models can show higher hallucination rates on summarization tasks than their non-reasoning counterparts. The reasoning helps on logic and math, but does not reliably reduce confabulation on entity-level facts. Treat reasoning models as an upgrade for some tasks, not a fix for hallucinations.

Is fine-tuning a good way to reduce hallucinations?

Rarely, and usually for the wrong reasons. Fine-tuning adds knowledge into the weights, which means errors get baked in. Per OpenAI’s documentation on fine-tuning best practices, fine-tuning is better for behavior shaping (tone, format, refusal patterns) than for factual knowledge injection. For facts, use RAG.

How do I measure hallucination rate in production?

Sample, label, and track. Pull 100 production responses per week, label each as “factual,” “refused appropriately,” or “hallucinated,” and track the hallucination rate over time. Tools like Galileo and Patronus AI offer hosted evaluation pipelines, but a Google Sheet plus a weekly review meeting is enough for the first 6 months. The discipline matters more than the platform.

When is multi-agent debate actually worth it?

Use debate for offline analysis where latency is not a constraint and the cost of error is high: due-diligence reports, contract analysis, post-call summaries graded by 3 separate model runs. For real-time conversational agents in sales, the other 6 techniques give you better factuality per dollar and per second.


Read next

Deploy your AI employee in 5 minutes

Try Tasmela free. Connect your tools and let an autonomous AI agent run 24/7.

Get started

AI guides, straight to the point

One email per month (max). Real cases, configs, lessons learned about autonomous AI employees.

No spam. One-click unsubscribe.