Summary (TL;DR)
Standard technical interviews are broken. They select for anxiety and explicit reasoning, not real skill. The "cost of a bad hire" statistic is unfounded; real research shows toxic workers cost twice as much as hiring a star. Whiteboard interviews penalize anxious candidates, and pair programming interviews lack collaboration context. MBTI and growth mindset tests are ineffective; Big Five personality is better. The Dreyfus model explains why expert candidates fail interviews: they think intuitively, not step-by-step, while competent interviewers evaluate through explicit rules. Similarity bias and Dunning-Kruger compound this. Google's internal data found no correlation between interview scores and job performance. The FATE metrics (Feedback, Accuracy, Time, Effort) help evaluate the process itself. Business scenario interviews that test domain knowledge beat abstract algorithm puzzles.
Technical Interviews Reject the Wrong Engineers
20 years of observation, 50 years of research, and a framework for measuring the interview instead of the candidate
5 days ago
Most companies treat hiring like a filter. Put candidates through enough rounds, ask enough questions, and the good ones will survive. The problem is that the filter is broken. It selects for the wrong things, rejects people it can’t evaluate, and costs more when it fails than most teams realize.
I’ve spent over 15 years observing and researching technical interviews. I’ve watched brilliant engineers get rejected because they didn’t solve a problem the way the interviewer expected. I’ve watched mediocre engineers get hired because they had practiced the right LeetCode patterns. And I’ve watched companies repeat this cycle while citing a “cost of a bad hire” statistic that, as it turns out, no one can trace to an actual source.
This post is about what the research says, where the common tools break down, and a framework I built to measure interview quality itself.
The cost of getting it wrong is… wrong?
You’ve probably seen the claim that the U.S. Department of Labor estimates a bad hire costs 30% of first-year earnings. I went looking for the original source. There isn’t one. No DOL publication, no report title, no URL. Every article cites the last in an infinite loop. The same is true of the “80% of turnover is due to bad hiring decisions” figure attributed to Harvard Business Review. No specific HBR article contains that number.
This is not surprising, it's the same effect as The Learning Pyramid. A bunch of trusted sites repeating the same thing they heard somewhere else with no reference to the original impirical sources.
The real research is less dramatic but more useful.
The Center for American Progress reviewed 30 case studies in 2012 and found the median replacement cost across all positions is about 21% of annual salary [1]. For workers earning under $75K, that number holds. For senior roles, it climbs to 213% [1].
But replacement cost is the wrong frame for engineering teams. The more useful finding comes from Housman and Minor’s 2015 Harvard Business School study of 50,000 workers across 11 firms [2]. They found that avoiding a toxic worker generates roughly twice the return of hiring a star performer. A toxic worker costs about $12,489 in direct replacement. A top-1% performer adds about $5,303 in value [2]. And toxic behavior spreads. When one joins a team, peers become more likely to behave the same way.
The biggest hiring risk is not missing a great candidate. It is letting a destructive one through.
The reason this matters: most interview processes are designed to find talent. Few are designed to detect toxicity, and the two goals require different signals.
Whiteboard interviews test whether a candidate can perform under observation while solving a problem they’d normally Google. A 2020 study by Behroozi et al. found that candidates given traditional whiteboard interviews with an observer performed at half the level of those who solved the same problems privately [3]. All women in the public condition failed. All women in the private condition passed [3].
This is not a talent filter. It is an anxiety filter.
Pair programming interviews are better, but they carry their own distortion. Pair programming was designed as a collaborative practice for producing code, not for evaluating a stranger’s skill under time pressure. When I pair with a colleague on my team, we share context, vocabulary, and trust. An interview has none of those. The candidate is performing while being watched by someone who holds power over their career. Calling that pair programming is like calling a job interview a conversation.
The deeper issue is tacit knowledge. Most of what a skilled engineer knows is not something they can articulate on demand. They recognize patterns. They sense when a design will cause problems in six months. They make tradeoffs that feel obvious to them but are invisible to someone at a different skill level. Standard interviews are built to test explicit knowledge: can you explain this algorithm, can you describe this pattern, can you walk through your reasoning.
The candidates who perform best are those who are good at talking about code, which is a different skill from writing it.
Some companies try to add science to the process with personality assessments. The two most popular choices in tech hiring are the Myers-Briggs Type Indicator and “growth mindset” screening. The research on both is clear.
The MBTI’s own publisher says using it for hiring is unethical [4]. That’s not a critic talking. That’s The Myers-Briggs Company, in writing, through their Senior Director of US Professional Services. The Myers & Briggs Foundation’s ethical guidelines state it directly.
The reason is simple: the test doesn’t measure what it claims to measure. Pittenger (2005) found that 35% of people get a different four-letter type when retaking the test after five weeks [5]. The National Academy of Sciences reviewed 20+ MBTI studies and concluded there is not enough evidence to justify its use [6]. Its predictive validity for job performance is about r = .10-.20, roughly the same as flipping a coin with a slight thumb on the scale.
Growth mindset fares no better. The largest meta-analysis (Sisk et al., 2018, covering 365,915 participants) found the correlation between mindset and achievement is r = .10, explaining about 1% of variance [7]. When Macnamara and Burgoyne (2023) restricted the analysis to the six highest-quality studies, the effect dropped to d = 0.02 [8]. Not small. Negligible. And researchers with financial ties to mindset interventions reported significantly larger effects than independent researchers did [8].
There is no peer-reviewed evidence that growth mindset predicts job performance in any workplace setting. Asking about it in interviews measures nothing useful.
Using Myers Briggs for hiring is unethical; using Carol Dweck's Growth Mindset measures nothing useful.
What actually works? The Big Five.
The Big Five personality model (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) has over three decades of meta-analytic support. Barrick and Mount (1991) established that conscientiousness predicts job performance across occupations [9]. Wilmot and Ones (2019) confirmed this across 1.1 million participants [10].
For software engineering, the picture has a useful nuance. Conscientiousness is a weaker predictor in high-complexity work [10]. Gnambs (2015) found that openness to experience and conscientiousness both predicted programming aptitude, and that openness has become more important over time as software work requires more creativity [11]. Introversion also correlated with programming skill [11].
There’s an honest caveat here. Even the best personality measures explain 4–6% of performance variance. They’re susceptible to faking. And extremely high conscientiousness can be counterproductive, producing rigid, perfectionistic engineers [12]. The Big Five should be one input in a multi-method process, not a standalone gate.
Seniority levels don’t work. Skill-specific assessment does.
Most companies assign a single seniority level to an engineer: junior, mid, senior, staff. This flattens a complicated reality. A senior engineer might be expert-level at API design and novice-level at frontend performance optimization. Labeling them “senior” tells you nothing about which problems they can solve.
The Dreyfus model of skill acquisition, published in 1980 by Stuart and Hubert Dreyfus, describes five stages: Novice, Advanced Beginner, Competent, Proficient, and Expert [13]. The stages differ not in how much someone knows, but in how they think.
At the Competent stage, typically reached after 2–3 years, practitioners break problems into components, apply rules, and build solutions step by step [13]. They value explicit reasoning. They believe skill is demonstrated by showing your work.
At the Expert stage, practitioners perceive situations as wholes and respond intuitively [13]. As the Dreyfus brothers described it, the expert doesn’t calculate or solve problems in the traditional sense. They draw on vast repertoires of pattern recognition, estimated at 100,000+ distinguishable situations for chess grandmasters. Neuroscience supports this: Amidzic et al. (2001) found experts and amateurs use different brain regions for the same tasks [14].
This creates a specific failure mode in interviews that I’ve seen play out dozens of times. A competent-level interviewer asks “walk me through your reasoning.” A genuine expert gives a sparse answer, not because they lack depth, but because their cognition doesn’t work through explicit rule-following. The interviewer marks this as shallow thinking. It is the opposite.
Andy Hunt captured this in Pragmatic Thinking and Learning: experts can be amazingly intuitive but completely inarticulate about how they arrived at a conclusion [15]. They genuinely don’t know. It just felt right. A competent interviewer hears “it felt right” and writes “no clear reasoning process” in their feedback.
The interview didn’t fail the candidate. The interviewer’s model of what skill looks like failed.
The Dreyfus mismatch is one half of the problem. Ego is the other.
Similarity bias is well-documented in hiring research. Rand and Wexley (1975) showed that biographical similarity between interviewer and candidate produces higher ratings regardless of qualifications [16]. Rivera (2012) found that more than half of hiring professionals at elite firms ranked “cultural fit” as the most important criterion, and that interviewers used themselves as models for the ideal candidate [17].
In technical interviews, this plays out as: “they didn’t solve it the way I would.” The interviewer has a preferred approach. The candidate uses a different one. The candidate’s approach might be better, but the interviewer can’t evaluate what they don’t recognize.
This connects directly to the Dreyfus model. A competent-level interviewer evaluates through rules. When a proficient or expert candidate bypasses those rules with pattern-matched intuition, the interviewer doesn’t see mastery. They see someone who skipped steps. And because the interviewer can’t distinguish “skipped steps due to incompetence” from “skipped steps due to operating at a higher cognitive level,” they default to the interpretation that protects their ego.
The Dunning-Kruger effect makes this worse [18]. Participants in the bottom quartile of skill estimated their performance at the 62nd percentile. The incompetent lack both skill and the ability to recognize their incompetence. An interviewer who is mediocre at system design may genuinely lack the ability to distinguish a good answer from a great one, while being fully confident in their evaluation.
Confirmation bias locks the whole cycle in place. Research shows 60% of interviewers make their decision within 15 minutes [19]. The rest of the interview is theater. Google’s own internal analysis of tens of thousands of interviews found, in Laszlo Bock’s words, “zero relationship” between individual interview scores and job performance [20].
Interview Metrics: Feedback, Accuracy, Time, and Effort
Most discussions about hiring focus on evaluating candidates. I think the more urgent question is: how do you evaluate the interview process? A bad process applied consistently will produce consistently bad results.
I developed the FATE Metrics over 15 years of observing interview processes across companies. FATE stands for Feedback, Accuracy, Time, and Effort. Each metric scores 0–10.
Feedback measures how useful the rejection feedback is to the candidate. A zero is a generic no-reply email. A ten is a personalized message from the hiring manager with specific strengths, areas for improvement, and a request for the candidate’s feedback on the process itself. Most companies score a 1 or 2 here. They send form rejections or ghost the candidate and wonder why their employer brand suffers.
Accuracy measures how closely the interview reflects the actual job. A zero is abstract algorithmic puzzles unrelated to daily work. “Given a bus, how many baseballs would you fit in them?” A ten is a real domain scenario: “We work on internal developer tools, and we’ve been asked to design an API for weather data. Where would you start?” High accuracy means the candidate works on a problem that resembles what they’d actually do in the role, so it will be different for each company. It means the interviewer doesn’t know the solution in advance, so they can’t test for “my way.”
Time measures duration from first contact to decision. A zero is months of rounds with no communication about next steps. A ten is a streamlined process: one shortlist stage, one or two more stages, decision within a week. Good interviews are shorter and more accurate. Long interviews usually mean the process is compensating for low accuracy with more rounds at the cost of the candidate's patience.
Effort measures how much work the candidate must do. A zero is a multi-day on-site with six rounds, twelve interviewers, and an uncompensated take-home assignment. A ten balances insight with respect for the candidate’s time: a skill self-assessment for shortlisting, a focused technical session, and a behavioral interview.
These metrics don’t make an interview process perfect. They’re instruments, like the gauges in a cockpit. They help you sense where the process is failing. A company scoring 2/8/7/3 knows their technical accuracy is strong but their feedback and effort demand are problems. That’s a specific, actionable finding.
Business scenarios over language trivia
The Accuracy metric deserves its own note because it addresses a confusion I see in almost every technical interview. Companies conflate three different types of knowledge and test for the wrong one.
Programming language knowledge is knowing the syntax and idioms of Python, TypeScript, or Go. Computer science knowledge is knowing Big-O notation, data structures, and algorithm families. Domain knowledge is knowing how to design a payment gateway that fits the specific business, model a subscription billing system, or structure an API for a specific business context.
Most whiteboard interviews test CS knowledge in isolation. But the job requires domain knowledge applied through a programming language. When you ask a candidate to implement a red-black tree on a whiteboard, you are testing for CS knowledge they will look up once every few years, in a format (memorized recall) that has no relationship to how they’ll use it.
Business scenario interviews fix this. Give the candidate a real problem from the team’s domain. Let them ask clarifying questions. Let them design before they code. Evaluate how they think about the problem, not whether they remembered the optimal sort. This tests what you actually need: can this person reason about our specific problems in a way that produces working software?
When the interviewer doesn’t know the answer in advance, similarity bias drops. The interviewer can’t grade the candidate against their own solution because they don’t have one. Both people are reasoning together. That’s closer to what the job looks like.
Just because everyone does it, that doesn't mean it works.
If you liked this, you might like readplace.com, built for exactly this kind of reading.
Thanks for reading. If you have some feedback, reach out to me on LinkedIn, Reddit or by replying to this post.
References
1: Boushey, H., & Glynn, S. J. (2012). There are significant business costs to replacing employees. Center for American Progress.
2: Housman, M., & Minor, D. (2015). Toxic workers. Harvard Business School Working Paper 16–057.
3: Behroozi, M., et al. (2020). Does stress impact technical interview performance? ACM ESEC/FSE.
4: The Myers-Briggs Company (2023). Should personality assessments be used in hiring? themyersbriggs.com.
5: Pittenger, D. J. (2005). Cautionary comments regarding the Myers-Briggs Type Indicator. Consulting Psychology Journal, 57(3).
6: Druckman, D., & Bjork, R. A. (1991). In the Mind’s Eye: Enhancing Human Performance. National Academy Press.
7: Sisk, V. F., et al. (2018). To what extent and under which circumstances are growth mind-sets important? Psychological Science in the Public Interest, 19(1).
8: Macnamara, B. N., & Burgoyne, A. P. (2023). Do growth mindset interventions impact students’ academic achievement? Psychological Bulletin, 149(3–4).
9: Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance. Personnel Psychology, 44(1).
10: Wilmot, M. P., & Ones, D. S. (2019). A century of research on conscientiousness at work. Proceedings of the National Academy of Sciences, 116(46).
11: Gnambs, T. (2015). What makes a computer wiz? Linking personality traits and programming aptitude. Journal of Research in Personality, 58.
12: Carter, N. T., et al. (2014). Uncovering curvilinear relationships between conscientiousness and job performance. Journal of Applied Psychology, 99(4).
13: Dreyfus, H. L., & Dreyfus, S. E. (1980). A five-stage model of the mental activities involved in directed skill acquisition. ORC 80–2, UC Berkeley.
14: Amidzic, O., et al. (2001). Pattern of focal γ-bursts in chess players. Nature, 412(6847).
15: Hunt, A. (2008). Pragmatic Thinking and Learning: Refactor Your Wetware. Pragmatic Bookshelf.
16: Rand, T. M., & Wexley, K. N. (1975). Demonstration of the effect, “Similar to Me,” in simulated employment interviews. Psychological Reports, 36(2).
17: Rivera, L. A. (2012). Hiring as cultural matching. American Sociological Review, 77(6).
18: Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it. Journal of Personality and Social Psychology, 77(6). Caveat: Recent research argues that the Dunning-Kruger effect is largely a statistical artifact rather than a true psychological bias (source).
19: Dougherty, T. W., Turban, D. B., & Callender, J. C. (1994). Confirming first impressions in the employment interview. Journal of Applied Psychology, 79(5).
20: Bock, L. (2015). Work Rules! Insights from Inside Google That Will Transform How You Live and Lead. Twelve Books.