Grounding pipeline

How SLAtech avoids hallucination

12-stage RAG-based grounding pipeline + structured citation system. Every response is grounded в tenant content — not fine-tuned. Visitors see citation snippets per source. Confidence-scored. Audit-trail logged. Continuous eval feedback. Pairs с architecture overview, eval scoreboard и AI ethics statement.

1. Tenant content ingestion

Documents (PDF, DOCX, scraped HTML, FAQ pairs, manually-authored articles) chunked into 200-500 token segments с 50-token overlap. Each chunk gets metadata: ClientId (tenant partition), sourceUrl, chunkIndex, lastUpdated.

Technical: Chunking algorithm respects document structure (paragraph boundaries) when possible. Overlap prevents context-loss across chunk boundaries.

2. Embedding generation

OpenAI text-embedding-3-small (1536 dimensions) converts each chunk into а semantic vector. Embeddings are stored в Qdrant с the chunk's metadata.

Technical: Same embedding model used for queries — ensures cosine-similarity comparisons are semantically valid.

3. Query embedding

When а visitor submits а message, the bot embeds the query using the same OpenAI model. Query embedding is filtered against ClientId before retrieval — cross-tenant contamination is structurally impossible.

Technical: Repository pattern enforces ClientId filter at compile time via static analyzer rule (SLATECH001).

4. Top-K retrieval

Qdrant returns the top-K (default 10) chunks с highest cosine similarity к the query. Default ScoreThreshold = 0.5 filters out low-relevance chunks.

Technical: TopK clamped к [1, 20] per per-tenant configuration. Below-threshold queries route к а "no relevant content" fallback rather than hallucinating.

5. Context assembly

Retrieved chunks + system prompt + conversation history pass к the LLM. System prompt explicitly instructs the LLM: "Answer only from the provided context. If the answer is not in the context, say so."

Technical: Token budget enforced (default 4000 tokens of context); если budget exceeds, low-score chunks are dropped first.

6. LLM generation

GPT-4o-mini (default) или а tenant-configured LLM generates the response, grounded в the retrieved context. Temperature default 0.3 для grounded customer-facing answers.

Technical: Per-tenant LLM provider abstraction allows OpenAI / Anthropic / Cohere swap without customer-side migration.

7. Citation extraction

The response includes а structured citation list: { sourceUrl, snippet, score } per retrieved chunk. Snippet is the actual quoted text — not just the URL.

Technical: BuildSnippet helper в QueryRequest record extracts the relevant 200-character span from the chunk.

8. SSE streaming с sources-early event

Server-Sent Events transport. First event is sources-early — emits citation metadata before LLM streaming starts. Widget renders "according to" hover-card while the answer is still streaming.

Technical: Cuts perceived latency by ~70% vs synchronous response. Enables AI scrapers к extract grounded quotes from the response.

9. LLM-as-Judge confidence scoring

Every response is scored по а secondary LLM call against three axes: factuality, hallucination и confidence. Scores surface в the admin Inbox.

Technical: Confidence score below 0.5 typically triggers а human-handoff fallback rather than а guessed answer.

10. Human-handoff fallback

When confidence is low OR query is identified as high-risk (clinical advice, legal position, regulatory question) — bot routes к а "human will follow up" pattern. The visitor receives an acknowledgement + а follow-up channel.

Technical: Per-vertical risk classifier tuned за each industry. Med routes ALL diagnosis-adjacent queries к а human; Legal routes ALL substantive-legal-question queries к а human.

11. Per-response audit trail

Every response logged с full context: input query, retrieved chunks с scores, system prompt, LLM model used, generated response, citation snippets, confidence scores.

Technical: Audit logs retained 13 months. Per-tenant audit log exportable on Enterprise tier.

12. Continuous eval feedback

Eval harness runs nightly against а sealed 200-question test set per vertical. Hallucination scores tracked over time. Regressions ≥3 points trigger а manual triage alert.

Technical: Eval methodology open-source — buyers can run it against their own SLAtech tenant. Published scoreboard at /en/eval/.

Verify the grounding на your own tenant

Eval harness и methodology are open-source. Run it against your SLAtech tenant к verify per-response factuality.

Eval scoreboard → Architecture overview