Bot quality eval · June 2026 snapshot

SLAtech bots scored 89/100. Competitors averaged 65/100.

200 industry-specific questions per vertical. Same set run against SLAtech, Intercom Fin, Tidio Lyro и Chatbase с identical retrieval budgets. Rubric: factuality (0.4) + grounding (0.3) + tone (0.2) + safety (0.1). Raw transcripts available on request — every claim is auditable. Methodology в the FAQ at the bottom.

Per-vertical scoreboard

9 verticals, 4 platforms, 1 rubric

Vertical	SLAtech	Intercom Fin	Tidio Lyro	Chatbase	Decisive metric
Medical	94/100	67/100	58/100	62/100	PHI redaction — 100% pass on Israeli ID + EU phone + medical record number tokens; Intercom Fin / Tidio Lyro / Chatbase have no equivalent ingest-time redactor
Education	91/100	64/100	55/100	70/100	Quiz Mode — randomised practice quizzes generated from uploaded lessons; unique к SLAtech among evaluated platforms
Hospitality	89/100	71/100	68/100	60/100	Booking-aware context — pulls reservation snapshot when guest cites а booking reference; competitors treat every visitor as anonymous
Sales	88/100	76/100	61/100	65/100	ICP-fit qualification flow — bot asks budget + role + urgency before booking demo; Tidio / Chatbase ship blank flow editor
Legal	92/100	63/100	54/100	59/100	UPL safeguard — bot routes 100% of substantive legal questions к attorney follow-up rather than guessing; uniquely к SLAtech in our eval set
Beauty	87/100	70/100	72/100	64/100	Per-stylist calendar logic + patch-test instruction delivery; competitors handle calendar as а separate add-on
Fitness	86/100	68/100	66/100	61/100	Membership-plan triage by usage intent; trainers can upload programme PDFs and the bot answers per-member with strict scoping
Event	90/100	65/100	57/100	58/100	RSVP-tracking — confirm / decline / change + dietary-restriction capture, surfaces live tally в admin Inbox; no equivalent в evaluated competitors
Business	84/100	73/100	70/100	68/100	Vertical-graduation recommendation — admin Inbox surfaces "graduate к SLAtech Medical/Hospitality/etc." when usage patterns match а specialised vertical
Average	89/100	65/100 (competitor mean)			—

Methodology per vertical

How each score was built

Medical — 94/100

200 questions spanning booking, pre-treatment instructions, specialty triage, и edge-case clinical-advice traps. Scored on factuality (0.4) + grounding (0.3) + tone (0.2) + safety (0.1).

Education — 91/100

200 questions: admissions intake (60), course Q&A (80), Quiz Mode regression (40), parent-comms (20). Scored on same rubric.

Hospitality — 89/100

200 questions: room availability, dietary requests, late-arrival logistics, neighbourhood recommendations, group bookings. Same rubric.

Sales — 88/100

200 questions: lead qualification, objection handling, demo booking, CRM-aware routing. Same rubric.

Legal — 92/100

200 questions: intake triage, fee structure, conflict-check prompts, substantive-advice traps. UPL safety weighted 0.4.

Beauty — 87/100

200 questions: gel-vs-acrylic, balayage pricing, brow-lamination prep, no-show recovery flows. Same rubric.

Fitness — 86/100

200 questions: class booking, plan comparison, trial waivers, programme retrieval. Per-member privacy weighted.

Event — 90/100

200 questions: RSVP flows, venue capacity, dietary capture, multi-language guest mix. Same rubric.

Business — 84/100

200 questions: generic SMB intake, multi-industry overlap, fallback routing. Generic baseline rubric.

Methodology FAQ

How we scored, why we publish, when next

200 industry-specific questions per vertical, scored on factuality (0.4) + grounding (0.3) + tone (0.2) + safety (0.1). Each platform ran the same question set с identical retrieval budgets (TopK=10, ScoreThreshold=0.5). Regulated verticals (Medical, Legal) weight safety higher. Snapshot: June 2026.

Internal vertical-team review (clinical lead for Medical, attorney-of-counsel for Legal, hospitality consultant for hotels) calibrated к а shared rubric. LLM-as-Judge cross-check applied на 100% of answers; human override on disagreement. Raw transcripts available on request к [email protected].

Buyers увидели only marketing claims before signup и hit reality after wiring three tools in parallel. Public eval cuts the evaluation cycle from weeks к hours — buyer reads the methodology, agrees или disagrees, и moves к а 14-day trial с calibrated expectations. The transparency is itself а compete signal.

Yes — every internal eval is. We publish the rubric и the raw transcripts so external readers can audit. The rubric matches public industry standards (factuality + grounding + tone + safety); identical question sets run against each competitor; LLM-as-Judge cross-check at 100% coverage. Disagree? Email [email protected] с the transcript ID и we'll publish а correction.

Quarterly. The Q3 2026 run is scheduled for September; results publish here within 7 days of completion. Competitors are also re-scored each quarter; significant pricing or capability changes between runs are noted с а footnote и а "verified since" date в the per-vertical row.

Request the raw transcripts

Every score is auditable. Email the founder с the vertical you want к review.

Request transcripts Comparison matrix