Ch 5: AI for Workforce Planning

Ch 5 — AI for Workforce Planning

The models, methods, and data architecture behind workforce intelligence — explained for operations leaders

Index ← High Level

Under the Hood

database

Collect

arrow_forward

mop

Clean

arrow_forward

model_training

Model

arrow_forward

fact_check

Validate

arrow_forward

rocket_launch

Deploy

arrow_forward

autorenew

Retrain

Click play or press Space to begin the deep dive...

Step- / 8

trending_up

What Attrition Models Actually Measure

Survival analysis, logistic regression, random forests — and the features that drive predictions

The Model Types

Attrition prediction isn’t one model — it’s a family of approaches, each with tradeoffs:

Survival analysis: Models the time until an event (departure). Answers “when is this person most likely to leave?” rather than just “will they leave?” Handles employees who haven’t left yet (censored data) properly.

Logistic regression: The workhorse. Predicts a probability of departure (0–100%) based on input features. Highly interpretable — you can see exactly which factors drive the score. Preferred when explainability matters.

Random forests / gradient boosting: Ensemble methods that combine hundreds of decision trees. Higher accuracy than logistic regression but harder to explain. Common in vendor platforms that prioritize predictive power.

Feature Importance

Feature Importance Ranking (typical model) 1. Comp ratio 0.23 // salary vs. market 2. Tenure bucket 0.19 // 18mo & 3yr peaks 3. Manager changes 0.14 // count in 18mo window 4. Skip-level distance 0.11 // layers to exec sponsor 5. Perf trajectory 0.09 // rating trend (up/flat/down) 6. Peer departure rate 0.08 // team attrition last 6mo 7. Promotion velocity 0.07 // time since last promo 8. Engagement delta 0.05 // survey score change 9. Commute distance 0.04 // for hybrid/on-site roles // Importance scores sum to 1.0 // Higher = more influence on prediction

Why each matters: Comp ratio captures market competitiveness. Tenure buckets reflect known departure windows. Manager changes signal instability. Skip-level distance measures visibility and sponsorship. The combination is more predictive than any single factor.

construction

The Feature Engineering Challenge

How raw HRIS data becomes useful model inputs — and why this matters more than model choice

Raw Data vs. Features

Your HRIS stores raw facts: hire date, manager ID, salary, performance rating. But models don’t consume raw facts effectively. Feature engineering transforms raw data into meaningful signals:

“Employee had manager IDs 4521, 3892, and 7104 over the last 18 months” becomes the feature “3 manager changes in 18 months.”

“Employee earned $95K in 2024 and the market midpoint for their role is $105K” becomes the feature “comp ratio: 0.90.”

This transformation step is where domain expertise matters most. An HR ops person who knows that the third manager change is qualitatively different from the first brings knowledge no data scientist has.

Feature Engineering Patterns

Time-windowed features manager_changes_18mo = 3 peer_departures_6mo = 2 engagement_delta_yoy = -0.8 // Window size matters: 6mo vs 12mo vs 18mo // captures different dynamics Ratio features comp_ratio = salary / market_midpoint promo_ratio = time_since_promo / avg_for_level team_turnover = departures / team_size // Ratios normalize across different scales Interaction effects high_performer_AND_low_comp = HIGH RISK new_manager_AND_low_engagement = HIGH RISK recent_promo_AND_high_comp = LOW RISK // Combinations are more predictive // than individual features alone

The insight: Data scientists often say “feature engineering matters more than model choice.” A simple logistic regression with expertly crafted features often outperforms a complex neural network with raw inputs. This is where HR ops knowledge directly improves model quality.

fact_check

Model Validation for HR

Why accuracy alone is misleading and how to properly evaluate attrition models

The Accuracy Trap

If your annual attrition rate is 15%, a model that predicts “nobody will leave” is 85% accurate. That’s useless. Accuracy is misleading when classes are imbalanced — and attrition is almost always imbalanced (most people stay). What you actually need:

Precision: Of the people the model flags as flight risks, how many actually leave? Low precision = too many false alarms, managers stop trusting the system.

Recall: Of the people who actually leave, how many did the model flag? Low recall = missing the departures that matter most.

There’s always a tradeoff. Higher recall catches more true departures but also raises more false alarms. The right balance depends on the cost of each type of error.

Validation Methods

Train/test split Train on 80% of data, test on 20% // Basic. Risk: random split may leak info Cross-validation Split data into 5 folds, train on 4, test on 1 Rotate and average results // More robust. Standard practice. Temporal validation ← REQUIRED for HR Train on 2024 data, test on 2025 data // The ONLY valid approach for time-series // predictions like attrition. If a vendor // doesn't do temporal validation, their // reported accuracy is inflated. Why it matters: Random split accuracy: 87% Temporal validation accuracy: 71% // The gap tells you how much the model // was memorizing vs. truly predicting

Vendor question: “Do you use temporal validation?” If a vendor reports accuracy from a random train/test split on attrition prediction, their numbers are inflated. Temporal validation — training on past data, testing on future data — is the only honest approach for time-dependent predictions.

account_tree

Skills Taxonomies and Ontologies

O*NET, ESCO, Lightcast, and how AI maps free text to structured skills

The Major Frameworks

O*NET: US Department of Labor taxonomy. 1,000+ occupations, each with detailed skill requirements, knowledge areas, and ability ratings. Free to use. Strong for established US roles. Slow to update for emerging skills.

ESCO: European Skills, Competences, Qualifications, and Occupations. 3,000+ occupations, 13,000+ skills. Multi-language. Broader than O*NET but less granular for technical roles.

Lightcast (formerly Emsi Burning Glass): Commercial taxonomy built from billions of job postings. Updates frequently. Strong on emerging skills. Expensive. The de facto standard for HR tech vendors.

Custom taxonomies: Company-specific skill lists. Perfectly aligned with your business but expensive to maintain and hard to benchmark externally.

How AI Maps Text to Skills

Input: Job description free text "Seeking a data analyst proficient in SQL, Python, and Tableau to support our finance team's forecasting and reporting needs." NLP extraction: sql → matched to SQL (Programming) python → matched to Python (Programming) tableau → matched to Data Visualization forecasting → matched to Financial Modeling reporting → matched to Business Reporting Challenges: "Excel" = basic spreadsheets or VBA/macros? "leadership" = managing people or leading projects? "AI experience" = using ChatGPT or building models? // Ambiguity is the norm, not the exception

The decay problem: Skills have a half-life. “Flash development” is obsolete. “Prompt engineering” didn’t exist 4 years ago. Any skills taxonomy needs a process for deprecating old skills and adding emerging ones — and most organizations don’t have that process.

casino

Scenario Simulation Architecture

Monte Carlo simulation in plain language — running thousands of “what if” scenarios

Monte Carlo in Plain Language

Monte Carlo simulation is a method for understanding uncertainty by running thousands of scenarios with randomized inputs. Here’s the intuition:

Instead of saying “we’ll lose 15 people next year,” you say: “Each employee has a probability of leaving. Let’s simulate next year 10,000 times, each time randomly determining who leaves based on their individual probabilities.”

After 10,000 simulations, you don’t get one number — you get a distribution of outcomes. “There’s a 50% chance we lose 12–18 people, a 25% chance we lose 19–25, and a 5% chance we lose more than 30.” That distribution is far more useful for planning than a single point estimate.

Interpreting Confidence Intervals

10,000 Simulation Runs // Engineering headcount, next 12 months P10 (optimistic): Net +12 engineers // Only 10% of simulations were better P50 (median): Net +4 engineers // Half of simulations above, half below P90 (pessimistic): Net -6 engineers // Only 10% of simulations were worse Key drivers of variance: Attrition rate: 12%–28% range Hiring velocity: 4–8 weeks avg time-to-fill Offer acceptance: 65%–85% rate // Plan for P50 but budget for P75 // Have a contingency plan for P90

The shift: Monte Carlo changes the conversation from “here’s our headcount plan” to “here’s the probability distribution of outcomes.” Leadership can then decide which percentile to plan for based on risk appetite — a much more honest and resilient approach.

balance

Pay Equity Analysis Methodology

Multivariate regression, legitimate factors, and the gap between statistical and legal standards

How It Works

Pay equity analysis uses multivariate regression to isolate unexplained pay differences. The process:

1. Define the outcome: Total compensation (or base salary, or total cash comp — each tells a different story)

2. Control for legitimate factors: Job level, job family, tenure, location, performance rating, relevant experience. These are factors that should drive pay differences.

3. Test for demographic differences: After controlling for legitimate factors, is there a statistically significant difference in pay by gender, race, age, or other protected class?

4. Assess significance: A 1.5% gap might be statistically significant in a large population but not in a small one. Both practical and statistical significance matter.

Legal vs. Statistical Standards

Statistical standard: p-value < 0.05 (95% confidence) // "There's less than 5% probability // this gap occurred by chance" Legal standard (US): "Similarly situated" employees // Courts define comparator groups // differently than statisticians The gap between them: Statistical: 3.2% gap, p=0.003 // Clearly statistically significant Legal question: Are these employees truly "similarly situated"? // Did the model control for all // legitimate factors the court would // recognize? Did it over-control // for factors that are themselves // influenced by discrimination? Warning: Controlling for "job level" can mask discrimination in promotion decisions. // If women are promoted less, controlling // for level hides the gap at its source

Critical nuance: Pay equity analysis requires legal counsel, not just data science. Over-controlling (adding too many variables) can mask real discrimination. Under-controlling can show gaps that have legitimate explanations. The right set of control variables is a legal and business judgment, not a purely statistical one.

assessment

HRIS Data Quality Scoring

A practical framework for assessing your AI readiness — with scoring criteria

The Five Dimensions

Data quality for workforce AI can be assessed across five dimensions. Each dimension is scored 1–5:

Completeness: What percentage of fields are populated? Are there systematic gaps (e.g., skills data empty for 60% of employees)?

Accuracy: Do the values reflect reality? Are job titles current? Do reporting structures match actual management relationships?

Consistency: Is the same concept represented the same way? “Sr. Analyst,” “Senior Analyst,” and “Analyst III” for the same role break consistency.

Timeliness: How current is the data? Performance ratings from 18 months ago are stale. Salary data should reflect current state.

Uniqueness: Are there duplicate records? Ghost employees? Test accounts in production data?

Scoring Template

HRIS Data Quality Scorecard // Score each dimension 1-5 per data domain Comp Accr Cons Time Uniq AVG Job titles 3 2 2 4 4 3.0 Reporting 4 3 4 3 5 3.8 Compensation 5 4 4 5 5 4.6 Skills 2 2 1 1 3 1.8 Performance 4 3 4 4 5 4.0 Demographics 5 5 5 5 5 5.0 AI Readiness Thresholds: 4.0+ = Ready for predictive models 3.0–3.9 = Descriptive analytics only <3.0 = Fix data before investing in AI

Start here: Run this scorecard before any AI procurement conversation. It takes 2–3 days of audit work and gives you an honest picture of readiness. Share the results with vendors — serious vendors will adjust their implementation plan accordingly. Unserious vendors will ignore it.

hub

Building Your Workforce Intelligence Stack

Architecture for connecting HRIS, ATS, LMS, and external data — build vs. buy decisions

The Architecture Layers

Layer 1 — Source systems: HRIS (Workday, BambooHR), ATS (Greenhouse, Lever), LMS (Cornerstone, LinkedIn Learning), payroll, engagement surveys, external market data (Lightcast, Radford, Mercer).

Layer 2 — ETL / Integration: Extract data from source systems, transform it into consistent formats, load it into a central store. This is where job title standardization, ID reconciliation, and data cleaning happen.

Layer 3 — Data warehouse: A single source of truth that combines data from all sources. Snowflake, BigQuery, Databricks, or even a well-structured SQL database.

Layer 4 — Analytics / ML: Dashboards, predictive models, and scenario simulations that consume the clean, integrated data.

Layer 5 — Consumption: How stakeholders access insights — dashboards, automated alerts, embedded in HRIS workflows, Slack/Teams notifications.

Build vs. Buy by Component

Source systems: BUY // Don't build an HRIS. Use Workday, // BambooHR, Rippling, etc. ETL / Integration: BUY or HYBRID // Tools: Fivetran, dbt, Workato // Custom transforms still needed Data warehouse: BUY // Snowflake, BigQuery, Databricks // Don't build your own database Analytics / ML: HYBRID // Buy dashboards (Tableau, Looker) // Build custom models for attrition, // skills gaps, scenario simulation // OR buy (Visier, One Model, Crunchr) People analytics platform: BUY vs BUILD BUY (Visier, One Model): Faster to deploy, standard models, less customization BUILD (warehouse + custom): More flexible, full data ownership, needs data team

The decision framework: Buy when the capability is standard (dashboards, data warehouses). Build when the capability is a competitive advantage or requires deep customization (custom attrition models, company-specific skills ontologies). Most organizations should start by buying a platform and build custom only where the platform falls short.