1. Ingestion tenant content
Документы (PDF, DOCX, scraped HTML, FAQ pairs, manually-authored articles) chunked в 200-500 token segments с 50-token overlap. Каждый chunk gets metadata: ClientId (tenant partition), sourceUrl, chunkIndex, lastUpdated.
Technical: Chunking algorithm respects document structure (paragraph boundaries) когда possible. Overlap предотвращает context-loss across chunk boundaries.
2. Embedding generation
OpenAI text-embedding-3-small (1536 dimensions) converts каждый chunk в semantic vector. Embeddings stored в Qdrant с chunk's metadata.
Technical: Тот же embedding model используется для queries — обеспечивает cosine-similarity comparisons semantically valid.
3. Query embedding
Когда visitor submits сообщение, бот embeds query, using тот же OpenAI model. Query embedding filtered против ClientId до retrieval — cross-tenant contamination — structurally невозможна.
Technical: Repository pattern enforces ClientId filter на compile time через static analyzer rule (SLATECH001).
4. Top-K retrieval
Qdrant returns top-K (default 10) chunks с highest cosine similarity к query. Default ScoreThreshold = 0.5 filters out low-relevance chunks.
Technical: TopK clamped к [1, 20] per per-tenant configuration. Below-threshold queries route к "no relevant content" fallback вместо hallucination.
5. Context assembly
Retrieved chunks + system prompt + conversation history pass к LLM. System prompt explicitly instructs LLM: "Answer only from provided context. If answer not в context, say so."
Technical: Token budget enforced (default 4000 tokens of context); если budget exceeds, low-score chunks dropped first.
6. LLM generation
GPT-4o-mini (default) или tenant-configured LLM generates response, grounded в retrieved context. Temperature default 0.3 для grounded customer-facing answers.
Technical: Per-tenant LLM provider abstraction позволяет OpenAI / Anthropic / Cohere swap без customer-side migration.
7. Citation extraction
Response includes structured citation list: { sourceUrl, snippet, score } per retrieved chunk. Snippet — actual quoted text — не just URL.
Technical: BuildSnippet helper в QueryRequest record extracts relevant 200-character span из chunk.
8. SSE streaming с sources-early event
Server-Sent Events transport. First event — sources-early — emits citation metadata до LLM streaming starts. Widget renders "according to" hover-card пока ответ всё ещё стримится.
Technical: Сокращает perceived latency на ~70% vs synchronous response. Enables AI scrapers extract grounded quotes из response.
9. LLM-as-Judge confidence scoring
Каждый response scored секондарным LLM call против трёх axes: factuality, hallucination и confidence. Scores surface в admin Inbox.
Technical: Confidence score below 0.5 обычно triggers human-handoff fallback вместо guessed answer.
10. Human-handoff fallback
Когда confidence low ИЛИ query identified как high-risk (clinical advice, legal position, regulatory question) — бот routes к "human will follow up" pattern. Visitor receives acknowledgement + follow-up channel.
Technical: Per-vertical risk classifier tuned для каждой industry. Med routes ВСЕ diagnosis-adjacent queries к human; Legal routes ВСЕ substantive-legal-question queries к human.
11. Per-response audit trail
Каждый response logged с full context: input query, retrieved chunks с scores, system prompt, LLM model used, generated response, citation snippets, confidence scores.
Technical: Audit logs retained 13 месяцев. Per-tenant audit log exportable на Enterprise tier.
12. Continuous eval feedback
Eval harness работает nightly против sealed 200-question test set per vertical. Hallucination scores tracked over time. Regressions ≥3 points trigger manual triage alert.
Technical: Eval methodology open-source — buyers могут запустить её против своего SLAtech tenant. Published scoreboard на /ru/eval/.