Ch 8: Evaluating AI Vendors

Ch 8 — Evaluating AI Vendors

Architecture, security, compliance, and integration — the technical evaluation deep dive

Index ← High Level

Under the Hood

memory

Architecture

arrow_forward

shield

Security

arrow_forward

gavel

Compliance

arrow_forward

hub

Integration

arrow_forward

speed

Performance

arrow_forward

exit_to_app

Exit

Click play or press Space to begin the deep dive...

Step- / 8

memory

Evaluating Model Architecture

Custom model, fine-tuned, or API wrapper — and why it matters

The Architecture Question

The single most important technical question: is this a custom model or a wrapper around GPT/Claude? Neither is inherently better, but the answer determines everything about reliability, data privacy, and long-term viability. A custom model means the vendor controls the stack. A wrapper means they’re dependent on OpenAI or Anthropic — if those providers change pricing, terms, or capabilities, your vendor is affected.

Why Architecture Matters for HR

Data privacy: If the vendor wraps GPT, your employee data may pass through OpenAI’s servers. Ask explicitly.

Reliability: Custom models can be hosted in your cloud or the vendor’s. API wrappers are subject to third-party outages. When OpenAI goes down, does your screening tool go down?

Consistency: API providers update models without notice. A prompt that worked yesterday may behave differently today. Custom models only change when the vendor retrains.

Architecture Assessment

Ask the vendor: "Is your model custom-built, fine-tuned, or prompt-engineered on top of a foundation model?" If custom-built: What architecture? (Transformer, etc.) What was it trained on? How often do you retrain? Where is it hosted? If fine-tuned: Which base model? (GPT-4, Claude, etc.) What data was used for fine-tuning? Do you control the base model version? If prompt-engineered (API wrapper): Which API provider? Does employee data pass to the provider? What happens during provider outages? What's your fallback? Are you on a fixed model version?

Technical reality: Most HR AI vendors in 2026 use API wrappers or fine-tuned models, not custom-built architectures. That’s not automatically bad — but you need to understand the data flow. If your employee PII routes through a third-party foundation model provider, your DPA needs to cover that chain.

shield

Security Assessment

What your security team should ask — and a questionnaire template

Security Baseline

SOC 2 Type II: The table stakes certification. Type I means they designed controls; Type II means an auditor verified they actually follow those controls over time. Accept only Type II.

Penetration testing: Annual third-party pen tests with remediation evidence. Not just “we do security testing” — ask for the most recent report summary.

Encryption: AES-256 at rest, TLS 1.2+ in transit. No exceptions. Ask specifically about encryption of AI training data and model weights.

Access controls: Role-based access, MFA for all admin accounts, principle of least privilege. How many vendor employees can access your data?

Security Questionnaire Template

INFRASTRUCTURE 1. Where is data hosted? (Region, provider) 2. SOC 2 Type II certification? (Date) 3. Last penetration test? (Date, firm) 4. Encryption at rest? (Standard) 5. Encryption in transit? (Protocol) ACCESS CONTROLS 6. MFA for all access? (Y/N) 7. # of employees with data access? 8. Background checks on staff? (Y/N) 9. Customer data segregation method? AI-SPECIFIC 10. Does data pass to third-party AI APIs? 11. Are prompts/responses logged? 12. Is customer data used for training? 13. Model versioning and rollback? INCIDENT RESPONSE 14. Breach notification timeline? 15. Incident response plan? (Share it) 16. Breach history in last 3 years?

AI-specific risk: Traditional security questionnaires miss AI-specific concerns. Questions 10-13 above are critical and often absent from standard vendor security reviews. Add them to your template — most vendors haven’t been asked these yet, and their answers will be revealing.

description

The Model Card Deep Dive

How to read a model card and what missing information tells you

What a Model Card Should Contain

A model card is a standardized document describing an AI model’s capabilities, limitations, and appropriate use. Think of it like a drug’s prescribing information — it tells you what it does, what it doesn’t, and where it might cause harm.

Intended use: What the model was designed for (and explicitly what it should not be used for).
Training data: What data the model learned from, including demographics and representativeness.
Performance metrics: Accuracy, precision, recall — ideally broken down by demographic group.
Limitations: Known failure modes, edge cases, and scenarios where performance degrades.
Ethical considerations: Bias assessment results, fairness metrics, and mitigation steps taken.

Model Card Analysis Example

EXAMPLE: Resume Screening Model Card Intended Use: Rank candidates for roles in technology and finance sectors Missing: Does it work for healthcare? Retail? If your industry isn't listed, ask why. Training Data: 2M resumes, 2019-2024 60% tech, 30% finance, 10% other Red flag: Heavy tech/finance skew. Performance on other industries = unknown. Performance: 88% agreement with human reviewers on top-tier classification Missing: Breakdown by gender, race, age. 88% overall could mask 70% for some groups. Limitations: "Reduced accuracy for non-English resumes" Good: At least they disclosed it. Ask how many of your candidates this affects.

The rule: What’s missing from a model card is as important as what’s there. No demographic breakdown of performance? The vendor either hasn’t tested for bias or doesn’t like what they found. No intended use limitations? They haven’t thought carefully about where the model fails.

hub

Integration Architecture Evaluation

API-first vs. embedded vs. hybrid — what actually integrates with your stack

Integration Patterns

API-first: The vendor provides APIs you call from your existing systems. Maximum flexibility but requires development resources. Best for: organizations with engineering support who want the AI embedded in existing workflows.

Embedded/standalone: The AI lives in the vendor’s UI. Your team logs into their platform separately. Easiest to deploy but creates another silo. Best for: teams without engineering resources who accept a separate workflow.

Hybrid: API access plus a standalone interface. The vendor provides both. Best for: most organizations — use the UI for quick tasks, the API for automated workflows.

Data Sync Patterns

Real-time (webhook): Changes in your HRIS push instantly to the AI tool. Ideal but complex.
Batch (scheduled): Data syncs on a schedule (hourly, daily). Simpler but data can be stale.
On-demand (API pull): The AI queries your system when needed. Flexible but adds latency.

Integration Evaluation Checklist

API QUALITY REST or GraphQL? ________ API documentation quality: ___/10 Rate limits: ________ Versioning strategy: ________ Sandbox/test environment: Y/N AUTHENTICATION SSO support (SAML/OIDC): Y/N SCIM provisioning: Y/N API key management: ________ YOUR STACK COMPATIBILITY HRIS connector: ________ ATS connector: ________ Payroll connector: ________ Custom webhook support: Y/N RELIABILITY SLA uptime guarantee: ____% Average API latency: ____ms Error handling approach: ________ Retry/backoff strategy: ________

Integration truth: The #1 reason AI tool deployments fail is integration, not the AI itself. A vendor may have a brilliant model, but if it can’t talk to your HRIS reliably, it’s useless. Test the integration during your pilot, not after you’ve signed.

speed

Performance Benchmarking

How to test vendor claims independently with your own data

Why You Must Benchmark Yourself

Vendor-reported accuracy is measured on their data, their way. Your data is different. Your definitions of “good candidate” or “flight risk” are different. Your workforce demographics are different. The only benchmark that matters is performance on your data, measured by your standards.

This isn’t adversarial — good vendors welcome independent testing. It protects both of you.

Creating a Test Dataset

1. Select representative cases: 200-500 cases from your actual data that span the full range of outcomes.
2. Label them: Have human experts classify each case (hired/not, high performer/not, retained/churned).
3. Include edge cases: Career gaps, industry changers, non-standard formats.
4. Ensure diversity: Your test set must represent all demographic groups in your population.
5. Blind the vendor: Don’t show them the labels. Give them the raw data and compare their outputs to your ground truth.

Benchmark Template

PERFORMANCE BENCHMARK TEMPLATE Test Set: ___ cases from your data Ground Truth: Human expert labels Overall Metrics Accuracy: ___% (matches / total) Precision: ___% (true pos / predicted pos) Recall: ___% (true pos / actual pos) F1 Score: ___ (harmonic mean of P & R) Fairness Metrics (by group) Selection rate by gender: ___ Selection rate by race: ___ Selection rate by age band: ___ Four-fifths rule pass: Y/N Statistical Significance Sample size sufficient: Y/N Confidence interval: ___ // N ≥ 200 for reliable metrics // 95% CI is the standard threshold Comparison to Baseline Current process accuracy: ___% AI improvement: +/- ___%

Power move: Run the same test dataset through multiple vendors. Standardized benchmarks make comparisons objective. The vendor who resists testing on your data is the vendor who knows their numbers won’t hold up.

gavel

Data Processing Agreement Analysis

Key DPA clauses, what they mean, and what to negotiate

Critical DPA Clauses

Purpose limitation: The vendor can only use your data for the specific purpose defined in the agreement. No secondary uses, no model training, no analytics for their own benefit. This clause prevents scope creep.

Sub-processor chains: Who else touches your data? If the vendor uses AWS for hosting and OpenAI for inference, both are sub-processors. You need to know the full chain and have approval rights over changes.

Data residency: Where is your data physically stored? This matters for GDPR (EU data in EU), and increasingly for state privacy laws. “Cloud-hosted” isn’t an answer — you need the region.

Breach notification: How quickly must the vendor notify you of a data breach? GDPR requires 72 hours. Your DPA should match or beat your strictest regulatory requirement.

DPA Clause Checklist

PURPOSE LIMITATION □ Data used only for defined services □ No model training on customer data □ No aggregation across customers SUB-PROCESSORS □ Full list of sub-processors provided □ Prior notice of sub-processor changes □ Right to object to new sub-processors DATA SUBJECT RIGHTS □ Process for access requests (DSAR) □ Process for deletion requests □ Reasonable response timeframes DELETION & RETURN □ Data returned in standard formats □ Deletion within 30 days of termination □ Written certification of deletion □ Backup deletion included BREACH NOTIFICATION □ Notification within 48-72 hours □ Details: scope, impact, remediation □ Cooperation with your incident response

Non-negotiable: If a vendor’s DPA doesn’t address these items, they either haven’t thought about data protection seriously or they don’t want to commit to it. Either way, that’s a disqualifying signal for any tool that will process employee PII.

exit_to_app

Exit Strategy Planning

Plan the exit before you sign — data portability, migration, and dependency risk

Why Plan the Exit Now

The AI vendor landscape is volatile. Companies get acquired, pivot, shut down, or change pricing dramatically. Your exit plan is your insurance policy. Negotiating exit terms after you’re dependent on a tool is like negotiating a prenup after the wedding — your leverage is gone.

Plan for three exit scenarios: voluntary switch (you find something better), vendor failure (they go under or get acquired), and contract dispute (terms change unacceptably).

Key Exit Questions

Data export: Can you export all data in standard formats (CSV, JSON) at any time, not just at termination?

Transition timeline: How long does the vendor support the transition? 30 days? 90 days? Is there a fee?

Knowledge transfer: If the AI made decisions or scored candidates, are those scores/decisions exportable and interpretable without the vendor’s tool?

Model dependency: Can your processes function without the AI tool? What’s the manual fallback?

Migration Risk Assessment

EXIT RISK SCORECARD Data Portability Risk: ___/5 Export formats available? Export includes all data + metadata? Can export at any time (not just exit)? Process Dependency Risk: ___/5 Can workflows run without the AI? Is there a manual fallback documented? How many processes depend on it? Institutional Knowledge Risk: ___/5 Are AI decisions/scores documented? Can you explain past decisions without access to the vendor's tool? Transition Cost Risk: ___/5 Estimated migration effort (weeks)? Vendor transition support included? Overlap period needed? TOTAL RISK: ___/20 // 1-5: Low risk. Standard transition. // 6-10: Moderate. Plan 60-90 day overlap. // 11-15: High. Negotiate strong exit terms. // 16-20: Critical. Reconsider the vendor.

The prediction problem: What happens to predictions the old AI system made? If a candidate was scored “85% match” by a tool you no longer have, can you explain that score in an audit? Ensure you can export not just data, but decision audit trails in a format that survives the vendor relationship.

scoreboard

Vendor Scorecard

A weighted scoring framework for objective vendor comparison

Scoring Methodology

Use this weighted scorecard to compare vendors objectively. Each category is scored 1-10 by relevant stakeholders, then multiplied by the category weight. The weights reflect what matters most for HR AI tools — functionality and data practices outweigh pricing because a cheap tool that creates compliance risk is the most expensive tool you can buy.

Who scores: Assemble a cross-functional team. HR Ops scores functionality and support. Security scores data practices. Legal scores compliance. IT scores integration. Finance scores pricing. Aggregate the scores.

How to Use This Scorecard

1. Score each vendor independently before comparing
2. Require written justification for any score below 5 or above 8
3. Disqualify any vendor scoring below 4 in data practices or compliance
4. Calculate weighted totals and rank
5. Use the scores as input to the decision, not the entire decision — qualitative factors matter too

Weighted Vendor Scorecard Template

VENDOR SCORECARD Vendor: ________________ Date: ________ Functionality Weight: 25% Core AI capability: ___/10 Accuracy on your data: ___/10 Feature completeness: ___/10 Category avg: ___ x 0.25 = ___ Data Practices Weight: 20% Data ownership terms: ___/10 Training data transparency:___/10 Export & portability: ___/10 Category avg: ___ x 0.20 = ___ Compliance Weight: 20% Bias audit availability: ___/10 Regulatory readiness: ___/10 SOC 2 / security posture: ___/10 Category avg: ___ x 0.20 = ___ Integration Weight: 15% API quality & docs: ___/10 HRIS/ATS compatibility: ___/10 SSO/SCIM support: ___/10 Category avg: ___ x 0.15 = ___ Pricing Weight: 10% Total cost of ownership: ___/10 Pricing transparency: ___/10 Category avg: ___ x 0.10 = ___ Support Weight: 10% Response time & quality: ___/10 Implementation support: ___/10 Category avg: ___ x 0.10 = ___ WEIGHTED TOTAL: ___/10

Next step: This scorecard is your starting point, not your final answer. After scoring, hold a calibration session where the evaluation team discusses scores, debates disagreements, and reaches consensus. The conversation is often more valuable than the numbers.