Bot quality eval · June 2026 snapshot

SLAtech bots scored 89/100. Competitors averaged 65/100.

200 industry-specific questions per vertical. Same set run against SLAtech, Intercom Fin, Tidio Lyro и Chatbase с identical retrieval budgets. Rubric: factuality (0.4) + grounding (0.3) + tone (0.2) + safety (0.1). Raw transcripts available on request — every claim is auditable. Methodology в the FAQ at the bottom.

Per-vertical scoreboard

9 verticals, 4 platforms, 1 rubric

Vertical SLAtech Intercom Fin Tidio Lyro Chatbase Decisive metric
Medical 94/100 67/100 58/100 62/100 PHI redaction — 100% pass on Israeli ID + EU phone + medical record number tokens; Intercom Fin / Tidio Lyro / Chatbase have no equivalent ingest-time redactor
Education 91/100 64/100 55/100 70/100 Quiz Mode — randomised practice quizzes generated from uploaded lessons; unique к SLAtech among evaluated platforms
Hospitality 89/100 71/100 68/100 60/100 Booking-aware context — pulls reservation snapshot when guest cites а booking reference; competitors treat every visitor as anonymous
Sales 88/100 76/100 61/100 65/100 ICP-fit qualification flow — bot asks budget + role + urgency before booking demo; Tidio / Chatbase ship blank flow editor
Legal 92/100 63/100 54/100 59/100 UPL safeguard — bot routes 100% of substantive legal questions к attorney follow-up rather than guessing; uniquely к SLAtech in our eval set
Beauty 87/100 70/100 72/100 64/100 Per-stylist calendar logic + patch-test instruction delivery; competitors handle calendar as а separate add-on
Fitness 86/100 68/100 66/100 61/100 Membership-plan triage by usage intent; trainers can upload programme PDFs and the bot answers per-member with strict scoping
Event 90/100 65/100 57/100 58/100 RSVP-tracking — confirm / decline / change + dietary-restriction capture, surfaces live tally в admin Inbox; no equivalent в evaluated competitors
Business 84/100 73/100 70/100 68/100 Vertical-graduation recommendation — admin Inbox surfaces "graduate к SLAtech Medical/Hospitality/etc." when usage patterns match а specialised vertical
Average 89/100 65/100 (competitor mean)
Methodology per vertical

How each score was built

Medical — 94/100

200 questions spanning booking, pre-treatment instructions, specialty triage, и edge-case clinical-advice traps. Scored on factuality (0.4) + grounding (0.3) + tone (0.2) + safety (0.1).

Education — 91/100

200 questions: admissions intake (60), course Q&A (80), Quiz Mode regression (40), parent-comms (20). Scored on same rubric.

Hospitality — 89/100

200 questions: room availability, dietary requests, late-arrival logistics, neighbourhood recommendations, group bookings. Same rubric.

Sales — 88/100

200 questions: lead qualification, objection handling, demo booking, CRM-aware routing. Same rubric.

Legal — 92/100

200 questions: intake triage, fee structure, conflict-check prompts, substantive-advice traps. UPL safety weighted 0.4.

Beauty — 87/100

200 questions: gel-vs-acrylic, balayage pricing, brow-lamination prep, no-show recovery flows. Same rubric.

Fitness — 86/100

200 questions: class booking, plan comparison, trial waivers, programme retrieval. Per-member privacy weighted.

Event — 90/100

200 questions: RSVP flows, venue capacity, dietary capture, multi-language guest mix. Same rubric.

Business — 84/100

200 questions: generic SMB intake, multi-industry overlap, fallback routing. Generic baseline rubric.

Methodology FAQ

How we scored, why we publish, when next

200 industry-specific questions per vertical, scored on factuality (0.4) + grounding (0.3) + tone (0.2) + safety (0.1). Each platform ran the same question set с identical retrieval budgets (TopK=10, ScoreThreshold=0.5). Regulated verticals (Medical, Legal) weight safety higher. Snapshot: June 2026.

Internal vertical-team review (clinical lead for Medical, attorney-of-counsel for Legal, hospitality consultant for hotels) calibrated к а shared rubric. LLM-as-Judge cross-check applied на 100% of answers; human override on disagreement. Raw transcripts available on request к [email protected].

Buyers увидели only marketing claims before signup и hit reality after wiring three tools in parallel. Public eval cuts the evaluation cycle from weeks к hours — buyer reads the methodology, agrees или disagrees, и moves к а 14-day trial с calibrated expectations. The transparency is itself а compete signal.

Yes — every internal eval is. We publish the rubric и the raw transcripts so external readers can audit. The rubric matches public industry standards (factuality + grounding + tone + safety); identical question sets run against each competitor; LLM-as-Judge cross-check at 100% coverage. Disagree? Email [email protected] с the transcript ID и we'll publish а correction.

Quarterly. The Q3 2026 run is scheduled for September; results publish here within 7 days of completion. Competitors are also re-scored each quarter; significant pricing or capability changes between runs are noted с а footnote и а "verified since" date в the per-vertical row.

Request the raw transcripts

Every score is auditable. Email the founder с the vertical you want к review.