Ch 8: Evaluating AI Vendors

Ch 8 — Evaluating AI Vendors

How to filter signal from noise, run a real evaluation, and negotiate from strength

Index Under the Hood →

High Level

explore

Discovery

arrow_forward

slideshow

Demo

arrow_forward

checklist

Diligence

arrow_forward

handshake

Negotiate

arrow_forward

science

Pilot

arrow_forward

rate_review

Review

Click play or press Space to begin the journey...

Step- / 8

explore

The Vendor Landscape

500+ vendors claiming AI — how to filter signal from noise

The Market Reality

The HR AI market is exploding. There are now 500+ vendors claiming to use AI in some form — from resume screening to workforce planning to benefits optimization. The problem? Most are repackaging basic automation or wrapping OpenAI APIs with a thin UI layer on top. The difference between a vendor that built genuine AI and one that calls ChatGPT behind the scenes matters enormously for reliability, data privacy, and long-term value.

Why This Is Hard

Unlike buying an HRIS or ATS where features are visible and testable, AI capabilities are opaque by nature. You can’t look at a demo and tell whether the system is running a sophisticated custom model or a prompt to GPT-4. You need a different evaluation playbook — and that’s what this chapter provides.

The Three Tiers of HR AI Vendors

Tier 1: AI-Native // Built their own models, trained on domain data // Have data scientists on staff // Can explain their architecture in detail // Examples: purpose-built ML platforms Tier 2: AI-Augmented // Core product isn't AI, but added AI features // Often uses third-party AI (OpenAI, etc.) // AI is a feature, not the foundation // Examples: HRIS adding "AI insights" Tier 3: AI-Marketed // Slapped "AI" on existing automation // Keyword matching rebranded as "AI screening" // Rules engines called "intelligent automation" // Examples: most legacy vendors' "AI updates"

Ops instinct: Tier doesn’t equal quality. A well-built Tier 2 tool that wraps GPT with proper guardrails may outperform a Tier 1 vendor with a poorly trained model. The tier tells you what questions to ask, not which to buy.

flag

Red Flags in Vendor Pitches

What vendors say vs. what they mean — and what to look for instead

Translating Vendor Claims

“Our AI is 99% accurate.” → Accuracy on what dataset? Measured how? 99% accuracy on an easy benchmark means nothing. Ask for false negative rates by demographic group.

“It eliminates bias.” → No AI eliminates bias. Good AI can be audited for bias and mitigated. Any vendor claiming elimination is either naive or dishonest.

“It works out of the box.” → Means it hasn’t been trained on your data, your culture, or your definitions of success. Generic models produce generic results.

“We use proprietary AI.” → Could mean anything from a custom-trained model to a prompt template they wrote for ChatGPT. Ask specifically: is it a custom model, a fine-tuned model, or an API wrapper?

Red Flags vs. Green Flags

Red Flags

Won’t share model documentation. “It’s proprietary.”
No bias audit results. “Our AI is fair by design.”
Vague accuracy claims. “Industry-leading performance.”
No customer references in your industry.
Pressure to sign before piloting.
Can’t explain what the AI actually does.

Green Flags

Publishes a model card with training data, metrics, and limitations.
Shares bias audit results broken down by protected class.
Offers a pilot period with defined success criteria.
Explains the human oversight model.
Can articulate what the AI does and doesn’t do.
Has a clear data processing agreement.

The test: Ask the vendor: “Can you show me your model card and your most recent bias audit?” Their response tells you everything. Confident transparency = good partner. Deflection = proceed with caution.

slideshow

The Demo Trap

Demos are designed to impress, not inform — here’s how to flip the script

Why Demos Are Misleading

Every AI vendor demo uses curated datasets, best-case scenarios, and hand-picked examples. The resume the AI screens perfectly? They chose it. The chatbot answer that’s flawless? They tested that question 100 times. The sentiment analysis that nails it? Pre-selected data. This isn’t dishonest — it’s marketing. But it means what you see in a demo has almost no correlation with real-world performance.

The Curated Demo Problem

Think of it like interviewing a candidate who only answers questions they prepared for. They sound brilliant. But can they handle unexpected situations? Edge cases? Messy real-world data? A demo tells you what the product can do at its best. You need to know what it does at its worst.

Questions That Reveal Real Capabilities

During the demo, ask: "Can I submit my own test data right now?" // If no: why not? What are they hiding? "Show me what happens when the AI is wrong." // How does the system handle errors? // Is there a confidence score? A fallback? "Run this edge case I brought." // Bring a resume in a weird format, a // non-standard job title, a bilingual doc "Show me the audit trail for that decision." // Can you see why the AI scored a // candidate the way it did? "What does the admin experience look like?" // Demos show the happy path. You need // the configuration and maintenance view

Pro tip: Before any demo, prepare 5-10 test cases from your own data. Include edge cases: a resume with a career gap, a non-English name, an unconventional career path. How the AI handles your data matters more than how it handles theirs.

checklist

Due Diligence Checklist

Technical, business, and legal diligence — the complete framework

Technical Diligence

Model documentation: Does the vendor publish a model card? Can they explain what type of AI it is (custom ML, fine-tuned LLM, API wrapper)?

Bias audit results: Have they conducted a bias audit? By whom? What were the results by protected class?

Data practices: Where is your data stored? Is it used to train their models? Who has access? How is it encrypted?

Business Diligence

Customer references: Can they provide 3+ references in your industry and company size?

Implementation timeline: Realistic timeline with milestones, not just “4-6 weeks.”

Support model: Dedicated CSM? Response time SLAs? Escalation path?

Legal Diligence

Data Processing Agreement // Must cover: purpose limitation, sub-processors, // data residency, breach notification, deletion Indemnification // Who's liable if the AI makes a discriminatory // decision? The vendor should share risk. Exit Terms // Can you export your data? In what format? // What's the transition timeline? // What happens to your data after termination? Compliance Commitments // SOC 2? GDPR? NYC Local Law 144? // Illinois AIPA? Colorado AI Act? // Will they conduct annual bias audits? IP & Confidentiality // Your employee data is your IP. // Is it segregated from other customers? // Is it used to improve their product?

Ops move: Create a standard vendor evaluation scorecard before you start looking. Having the criteria defined upfront prevents “demo dazzle” — where the flashiest demo wins over the most solid product.

database

Data Ownership & Portability

Who owns what, and what happens when you leave

The Questions That Matter

Who owns the data you put in? You should retain full ownership of all employee data, candidate data, and any content you create within the platform. This sounds obvious but must be explicit in the contract.

Can you export your data if you leave? In what format? CSV? API? Proprietary format that requires their tools to read? Data portability is your insurance policy.

Does the vendor use your data to train their models? Many AI vendors improve their models using customer data. If your employee data is training a model that serves your competitors, that’s a problem.

What happens after contract termination? Is your data deleted? Within what timeframe? Can you verify deletion? What about backups?

Critical Contract Terms

Data Ownership Clause "Customer retains all rights, title, and interest in Customer Data." // Non-negotiable. Walk if they push back. Training Opt-Out "Vendor shall not use Customer Data to train, improve, or develop any models." // Or at minimum: explicit opt-in only Export Rights "Customer may export all data in standard machine-readable formats at any time." // "Standard" = CSV, JSON, not proprietary Post-Termination "All Customer Data deleted within 30 days of termination, with written certification." // Include backups in the deletion scope

The leverage: Data ownership terms are easier to negotiate early. Once your data is in their system and your team depends on the tool, switching costs give the vendor leverage. Get these terms right at signing, not renewal.

payments

Pricing Models & Hidden Costs

How to calculate true total cost of ownership

Common Pricing Models

Per-seat: Pay per user or admin. Simple but can get expensive at scale. Watch for “user” definitions — is it admins only or all employees?

Per-transaction: Pay per resume screened, per chatbot conversation, per analysis run. Costs scale with usage, which can be unpredictable.

Per-model: Pay for specific AI capabilities. Want to add sentiment analysis? That’s another module, another fee.

Platform fee + usage: Base platform fee plus per-unit usage charges. Most common for AI-native tools. Gives predictable baseline with variable upside.

Hidden Cost Calculator

Visible Costs License fee: $_____/yr Hidden Costs (ask about each) Implementation: $_____ Data migration: $_____ Integration development: $_____ Admin training: $_____ End-user training: $_____ Customization: $_____ Ongoing support tier: $_____/yr Bias audit (if separate):$_____/yr Additional modules: $_____ Overage charges: $_____ True TCO = Visible + Hidden // Typical hidden costs = 1.5-3x license fee // in Year 1, 0.5-1x in subsequent years

Ops instinct: Always ask: “What does a customer in our size range spend in Year 1, all-in?” Then: “What does Year 2 look like?” If they can’t or won’t answer, the hidden costs are significant.

handshake

Negotiation Leverage Points

What you can negotiate and what gives you leverage

What You Can Negotiate

Pilot terms: Free or reduced-cost pilot period with defined success criteria. 30-90 days is standard. Push for no commitment to buy if the pilot fails.

Bias audit requirements: Require the vendor to conduct and share annual bias audits at their expense, with results broken down by protected class.

SLA commitments: Uptime guarantees, response time for support tickets, maximum latency for AI predictions. Tie SLAs to financial penalties.

Data deletion guarantees: Specific timelines and written certification of deletion upon termination.

Price protection: Cap annual price increases at 3-5%. Multi-year deals should lock rates.

Exit terms: Data export assistance, transition support period, reasonable notice requirements.

What Gives You Leverage

Competing vendors // Always evaluate 2-3 vendors in parallel. // Let each vendor know you're comparing. // Competition drives better terms. Regulatory requirements // "We need bias audits because NYC Local // Law 144 requires them." Vendors can't // push back on legal obligations. Data volume // Large datasets are valuable to vendors. // If you have 10K+ employees, your data // volume is a negotiating asset. Timing // End of quarter = vendor needs to close. // Best pricing happens in Q4 or at fiscal // year-end. Don't rush — let them come to you. Reference willingness // Offering to be a case study or reference // customer is worth 10-15% in discounts. // Only offer if you'd actually recommend them.

Negotiation truth: Vendors expect you to negotiate. The first price is never the best price. If they say “this is our standard pricing,” that’s the opening position, not the final offer. Every term is negotiable until you sign.

science

The Pilot Framework

How to run a proper pilot that actually tells you what you need to know

What a Proper Pilot Looks Like

Defined success metrics: Before the pilot starts, agree on exactly what success looks like. Accuracy rate? Time saved? User satisfaction score? Adverse impact ratio? Write it down.

Time-bound: 30-90 days. Long enough to get real data, short enough to maintain urgency. Set milestone check-ins at 2 and 4 weeks.

Controlled comparison: Run the AI process alongside your current process. Compare outcomes directly. Don’t just measure the AI in isolation.

Representative data: Use real data that reflects your actual workforce and candidate pool. Don’t cherry-pick easy cases.

Pilot Scorecard Template

PILOT EVALUATION FRAMEWORK Accuracy Weight: 25% Does the AI match human decisions? False positive rate: ___ False negative rate: ___ Bias Weight: 25% Four-fifths rule compliance: ___ Disparate impact by group: ___ User Experience Weight: 20% Admin satisfaction: ___/10 End-user satisfaction: ___/10 Time saved per task: ___ Integration Weight: 15% Data sync reliability: ___ Latency: ___ms average Errors/week: ___ Support Weight: 15% Response time: ___hrs average Issue resolution rate: ___% Escalation needed: ___x GO/NO-GO THRESHOLD: Score ≥ 7/10

The golden rule: Never commit to a full contract without a pilot. Any vendor who resists a pilot is a vendor who knows their product won’t hold up under real conditions. A confident vendor welcomes the chance to prove their value with your actual data.