12 Types of Performance Review Bias to Watch For (With Engineering Examples & Fixes)
The short version. Most lists of review biases name them and stop there. This one does three things competitors don't: (1) goes to 12, not 10; (2) gives every bias a realistic engineering-team example you've probably lived through; (3) ends with a single cross-cutting antidote that actually works — evidence-grounding plus structure. The matrix at the bottom is the bit to bookmark.
By Samira Bahmanyar · HR Manager
Last updated 2026-05-19 · Field guide for engineering managers and people leaders.
Related: how to reduce bias in performance reviews.
Why naming the bias is the first fix
Bias in performance reviews isn't a character flaw. It's the predictable output of a brain asked to summarise six months of someone's work from imperfect memory under time pressure. Every bias on this list is a shortcut your brain takes when it doesn't have the evidence in front of it.
That's the through-line. You can't will yourself out of these — but you can design the review process so the shortcuts aren't needed. We'll get to that. First, the 12.
For the deeper "how do I fix all of these at once" answer, see our companion piece on how to reduce bias in performance reviews.
The 12 biases
1. Recency bias
Definition. You over-weight what happened in the last 4–6 weeks of a review period and under-weight the first five months.
Engineering example. Priya led the migration off the legacy auth service in February — a six-month effort that quietly de-risked the whole platform. In late June, an unrelated incident she was on-call for took two extra hours to resolve. The review opens with "needs to work on incident response."
Fix. Pull a chronological evidence log for the entire period before writing — PRs merged, tickets closed, incidents owned, channel threads led — and weight by impact, not recency. This one bias is so common it gets its own deep-dive: recency bias in performance reviews.
2. Primacy bias
Definition. The opposite of recency. You over-weight first impressions and early work, and discount everything after.
Engineering example. A new senior engineer joined in January, had a rocky first 30 days adapting to your stack, then shipped three major features cleanly. The review still leads with "ramp-up was slow."
Fix. Same evidence log as recency, plus an explicit check: would I write this review the same way if the first month had been swapped with the last?
3. Halo and horns effect
Definition. One strong trait makes everything else look strong (halo). One weakness makes everything look weak (horns).
Engineering example. Marcus is an exceptional systems thinker. His design docs are legendary. His PR reviews are also slow and his junior mentorship is inconsistent — but the review rates him "exceeds expectations" across all five competencies because the design-doc halo bleeds.
Fix. Rate each competency in isolation with its own evidence. Force yourself to cite a different artifact for each one. If you can't cite different evidence, you're rating the halo, not the competency.
4. Leniency and strictness bias
Definition. A manager who rates everyone "exceeds expectations" (leniency) or everyone "meets" (strictness), regardless of actual performance.
Engineering example. Engineering Manager A's team all get "exceeds" — she dislikes giving hard feedback. Engineering Manager B's team all get "meets" — he believes "exceeds" must be earned through heroics. Two engineers shipping identical work get different ratings depending on which org they sit in.
Fix. Calibration sessions across managers, anchored to specific artifacts. "Show me the PR or design doc that earned the 'exceeds'" is the question that re-anchors both extremes.
5. Central tendency bias
Definition. Defaulting everyone to the middle of the scale because the middle feels safest.
Engineering example. Out of eight reports, seven get "meets expectations" on every dimension. The eighth ships the platform-wide observability rewrite — also "meets." The rating distribution is statistically improbable.
Fix. Force distribution isn't the answer (it has its own problems), but a rule-of-thumb works: if a single rating appears for more than ~60% of your reports across all competencies, audit the evidence. You're probably defaulting.
6. Similarity and affinity bias
Definition. You rate people more like you — same school, same background, same communication style, same hobbies — more favorably.
Engineering example. Both Dan and Aisha shipped roughly equivalent work this half. Dan went through a similar bootcamp-to-staff-engineer path you did. The review of Dan reads warmer, leans into "growth trajectory," and projects forward; the review of Aisha sticks to deliverables and reads cooler. Same evidence, different framing.
Manager script. Before drafting: "If I swap the names on these two reviews, does the framing still make sense?" If not, the warmer one is doing affinity work.
Fix. Write reviews evidence-first, then add framing last. Have a peer reviewer compare any two reviews side-by-side for tone parity.
7. Gender bias
Definition. Systematic differences in language and rating that track to gender, not performance. Women get more personality feedback ("abrasive", "supportive"); men get more competency feedback ("strategic", "technical"). Women's accomplishments get framed as luck or team effort; men's get framed as individual capability.
Engineering example. Two staff engineers led equivalent migrations. The man's review: "drove the migration, made the key architectural call on event schemas." The woman's review: "was a great collaborator on the migration; the team came together under her facilitation."
Fix. Screen drafts for gendered framing patterns specifically — attribution, agency verbs ("drove" vs. "supported"), personality-vs-competency ratio. Tools can do this in seconds; humans miss it. PerfCopilot's bias check flags gendered language automatically, but a manual swap-the-name test catches the worst of it too.
8. Tenure bias
Definition. Long-tenured employees get rated more favorably (or sometimes less favorably, in "stale" framings) than their actual work warrants. New hires get rated on potential; veterans on history.
Engineering example. A 7-year veteran has been coasting on platform knowledge for two halves; the review still rates "exceeds" because "she's the institutional memory." A 9-month senior hire shipped three high-leverage projects; the review rates "meets" because "still learning the stack."
Fix. Rate the period, not the resume. Evidence from this half only. If the long-tenured engineer's "exceeds" is supported entirely by work from years ago, it's a tenure rating, not a performance one.
9. Idiosyncratic rater bias
Definition. Different managers have systematically different internal scales. Manager A's "exceeds" = Manager B's "meets." Research on multi-rater data suggests more than half the variance in ratings reflects the rater, not the ratee (Cappelli & Tavis, The Performance Management Revolution, HBR 2016).
Engineering example. When the same engineer transfers from Team A to Team B mid-year, their rating drops a full tier — even though peers and stakeholders rate them identically.
Fix. Calibration. Multiple raters. Anchored rubrics with example artifacts at each level ("here's what an 'exceeds' design doc looks like"). The single-manager rating is the noisiest data point in the entire HR stack.
10. Contrast bias
Definition. You rate someone relative to the person you just reviewed, not against the standard.
Engineering example. You finish reviewing your strongest staff engineer and immediately start drafting the review of a solid mid-level. The mid-level reads as underwhelming in the comparison, even though their work meets every expectation for their level.
Fix. Don't draft reviews back-to-back. Write each one against the written rubric for that level, not against the previous review's tone.
11. Confirmation bias
Definition. You form an early impression of the period (good half / bad half) and then unconsciously gather evidence that confirms it while filtering out evidence that doesn't.
Engineering example. You've decided Sam had "a rough half" after a missed deadline in April. Sam also shipped the rate-limiter rewrite, mentored two juniors through promotion, and led the on-call rotation cleanup. None of those appear in the draft — they don't fit the story.
Manager script. Force yourself to write a one-paragraph counter-case: "What would the review look like if I started from the opposite premise?" If you can write that paragraph credibly, your original framing was confirmation, not assessment.
12. Spillover bias
Definition. Last period's rating influences this period's rating, independent of this period's actual work.
Engineering example. Engineer got "exceeds" last half on a high-visibility launch. This half they did solid but unremarkable maintenance work. Review rates "exceeds" again because "she's an exceeds-level engineer" — not because this period earned it.
Fix. Don't look at last period's rating until after this period's draft is written. Rate the half, not the career arc.
The one antidote that cuts across all 12
If you read this list and felt the panic of "how do I track all of this in every review I write" — good news. There's a single move that addresses most of these biases at once: ground every claim in a specific artifact from the review period, and structure the review the same way every time.
Evidence-grounding kills recency, primacy, halo, horns, leniency, strictness, central tendency, confirmation, and spillover all at once — because each of those biases is your brain filling in a gap that evidence would have filled. Structure (the same competency rubric, the same template, the same calibration step) kills idiosyncratic rater, contrast, and central tendency.
That leaves similarity and gender bias as the two that evidence alone doesn't fix — those need a separate screen on language and framing, not just on what's cited. That's why the best bias-reduction stacks pair the two: evidence + a language screen.
For the full implementation playbook, see how to reduce bias in performance reviews and the pillar guide to performance review software.
Bias → countermeasure matrix
| # | Bias | One-line countermeasure | |---|---|---| | 1 | Recency | Pull a full-period evidence log before drafting | | 2 | Primacy | Same — and ask "would I write this if month 1 and month 6 were swapped?" | | 3 | Halo / horns | Cite a different artifact for each competency | | 4 | Leniency / strictness | Calibration session anchored to specific artifacts | | 5 | Central tendency | Audit your distribution; if >60% land in one rating, re-check evidence | | 6 | Similarity / affinity | Swap-the-name test on any two reviews before submitting | | 7 | Gender | Automated language screen + agency-verb audit | | 8 | Tenure | Rate the period only; evidence must come from this half | | 9 | Idiosyncratic rater | Multi-rater input + anchored rubric with example artifacts | | 10 | Contrast | Don't draft reviews back-to-back; rate against the rubric, not the previous review | | 11 | Confirmation | Write a counter-case paragraph from the opposite premise | | 12 | Spillover | Hide last period's rating until this period's draft is written |
This matrix is the citable summary — copy it, print it, paste it into your team's review SOP.
Where software helps (one honest line)
Most of these fixes are workflow changes, not tool purchases. The two that genuinely benefit from automation are evidence-gathering across the full period and language screening for gender/tenure/recency framing. PerfCopilot does both — pulls the period's actual PRs, tickets, and threads, then screens the draft for gendered language, tenure framing, and recency over-weighting. Free for teams up to 5; Pro $4.99/user/month, billed annually.
Frequently asked questions
What are the most common types of bias in performance reviews?
Recency bias, halo effect, leniency, similarity bias, and gender bias are the five most documented in HR research and the five most likely to appear in any given review draft. The other seven on this list (primacy, central tendency, strictness, tenure, idiosyncratic rater, contrast, confirmation, spillover) are real but less frequent.
Which performance review bias is hardest to fix?
Idiosyncratic rater bias — the systematic difference between managers' internal scales — is the hardest because it's invisible to the individual rater. The only real fix is multi-rater input and calibration, which most small teams skip. A single-manager rating is the noisiest signal in your HR data.
How do you reduce gender bias in performance reviews specifically?
Two moves in sequence. First, write the review evidence-first — cite a specific PR, ticket, or thread for every claim — to remove the gaps where gendered framing slips in. Second, run a language screen on the draft (manual swap-the-name test, or automated) to catch attribution, agency-verb, and personality-vs-competency patterns. Doing only one of the two is not enough.
Does AI bias-checking software actually work for performance reviews?
For the language-pattern biases (gender, tenure framing, recency over-weighting), yes — automated screens catch patterns that human reviewers consistently miss in their own drafts. For the evidence-driven biases (halo, leniency, confirmation), software helps mainly by surfacing the evidence so the human can spot the gap. The tool isn't the judge; it's the spell-checker.
What is the most overlooked bias on this list?
Spillover bias. Most managers don't realize they're rating the career arc instead of the period. The simple fix — don't look at last period's rating until this period's draft is written — feels artificial until you try it and notice how often it changes your conclusion.
Field guide maintained by the PerfCopilot team. We build review-writing software that pulls real shipped work into cited, bias-screened drafts. If you want to skip the manual evidence-gathering step, start free — no credit card, up to 5 seats.