1. Tenant content ingestion
Documents (PDF, DOCX, scraped HTML, FAQ pairs, manually-authored articles) chunked into 200-500 token segments с 50-token overlap. Each chunk gets metadata: ClientId (tenant partition), sourceUrl, chunkIndex, lastUpdated.
Technical: Chunking algorithm respects document structure (paragraph boundaries) when possible. Overlap prevents context-loss across chunk boundaries.
2. Embedding generation
OpenAI text-embedding-3-small (1536 dimensions) converts each chunk into а semantic vector. Embeddings are stored в Qdrant с the chunk's metadata.
Technical: Same embedding model used for queries — ensures cosine-similarity comparisons are semantically valid.
3. Query embedding
When а visitor submits а message, the bot embeds the query using the same OpenAI model. Query embedding is filtered against ClientId before retrieval — cross-tenant contamination is structurally impossible.
Technical: Repository pattern enforces ClientId filter at compile time via static analyzer rule (SLATECH001).
4. Top-K retrieval
Qdrant returns the top-K (default 10) chunks с highest cosine similarity к the query. Default ScoreThreshold = 0.5 filters out low-relevance chunks.
Technical: TopK clamped к [1, 20] per per-tenant configuration. Below-threshold queries route к а "no relevant content" fallback rather than hallucinating.
5. Context assembly
Retrieved chunks + system prompt + conversation history pass к the LLM. System prompt explicitly instructs the LLM: "Answer only from the provided context. If the answer is not in the context, say so."
Technical: Token budget enforced (default 4000 tokens of context); если budget exceeds, low-score chunks are dropped first.
6. LLM generation
GPT-4o-mini (default) или а tenant-configured LLM generates the response, grounded в the retrieved context. Temperature default 0.3 для grounded customer-facing answers.
Technical: Per-tenant LLM provider abstraction allows OpenAI / Anthropic / Cohere swap without customer-side migration.
7. Citation extraction
The response includes а structured citation list: { sourceUrl, snippet, score } per retrieved chunk. Snippet is the actual quoted text — not just the URL.
Technical: BuildSnippet helper в QueryRequest record extracts the relevant 200-character span from the chunk.
8. SSE streaming с sources-early event
Server-Sent Events transport. First event is sources-early — emits citation metadata before LLM streaming starts. Widget renders "according to" hover-card while the answer is still streaming.
Technical: Cuts perceived latency by ~70% vs synchronous response. Enables AI scrapers к extract grounded quotes from the response.
9. LLM-as-Judge confidence scoring
Every response is scored по а secondary LLM call against three axes: factuality, hallucination и confidence. Scores surface в the admin Inbox.
Technical: Confidence score below 0.5 typically triggers а human-handoff fallback rather than а guessed answer.
10. Human-handoff fallback
When confidence is low OR query is identified as high-risk (clinical advice, legal position, regulatory question) — bot routes к а "human will follow up" pattern. The visitor receives an acknowledgement + а follow-up channel.
Technical: Per-vertical risk classifier tuned за each industry. Med routes ALL diagnosis-adjacent queries к а human; Legal routes ALL substantive-legal-question queries к а human.
11. Per-response audit trail
Every response logged с full context: input query, retrieved chunks с scores, system prompt, LLM model used, generated response, citation snippets, confidence scores.
Technical: Audit logs retained 13 months. Per-tenant audit log exportable on Enterprise tier.
12. Continuous eval feedback
Eval harness runs nightly against а sealed 200-question test set per vertical. Hallucination scores tracked over time. Regressions ≥3 points trigger а manual triage alert.
Technical: Eval methodology open-source — buyers can run it against their own SLAtech tenant. Published scoreboard at /en/eval/.