Six-Month Reviews Predict Nothing
Mid-year performance reviews are not bad predictors of year-end performance. They are not predictors at all. The output is paperwork, defensive behavior, and false confidence — produced at a cost most CFOs have never modeled.
In a study of 4,492 managers receiving performance ratings from seven different evaluators each — two bosses, two peers, two subordinates, and themselves — researchers found that 62 percent of the variance in ratings could be attributed to the idiosyncratic perceptions of the rater. Only 21 percent could be attributed to the actual performance of the person being rated. The pattern held across two large datasets, was consistent with more than a century of prior rating research, and has proven robust to nearly every intervention designed to reduce it: calibration sessions, rater training, structured rubrics, multi-rater frameworks. Across the literature, the rater accounts for two to three times more of the rating's variance than the ratee does.
The study, by Scullen, Mount, and Goff, was published in the Journal of Applied Psychology. It is the standard reference in the academic literature on performance ratings, and it is one of the most ignored data points in corporate America. Mid-year performance reviews continue to be conducted in roughly half of U.S. organizations. Managers spend an average of 210 hours per year on formal performance management activities — more than five work weeks of overhead per manager, per year, according to CEB research that has held up for over a decade. Gallup estimates the cost of performance management at large organizations runs as high as 2.4 to 35 million dollars annually in lost working hours. The output of all this effort is a rating that, by the strongest empirical evidence available, measures the rater more than the ratee.
This is the structural problem with mid-year reviews, and it does not get solved by improving the rubric, training the raters, or adding calibration sessions. The problem is not in the implementation. It is in the assumption that a manager's six-month assessment of an employee, expressed as a numerical or categorical rating, contains predictive signal about future performance.
The assumption is approximately false. The reviews predict almost nothing. They generate paperwork, defensive behavior, and false confidence — and the cost of producing them, multiplied across a workforce of any size, runs into the millions of dollars annually for the typical mid-market organization.
What the Research Actually Shows
The empirical case against performance ratings is older, broader, and more consistent than most HR functions acknowledge.
The Scullen-Mount-Goff finding is the most-cited because of its scale and methodological cleanness, but it is one data point in a long sequence. Mount and colleagues, in earlier work, estimated idiosyncratic rater effects at 72 percent of variance.
Across the multi-source feedback literature, average rater variance hovers around 58 percent. Studies dating back to Cooper in 1981 and continuing through Kingstrom and Bass, Woehr and Huffcutt, and others have repeatedly shown that interventions designed to reduce rater bias — calibration meetings, structured rubrics, rater accountability, training programs — produce small effects, if any, on the underlying variance pattern.
Rater bias is not a defect of inadequate training. It is structural to the act of one person assessing another over a six-month period and condensing the assessment into a rating.
The implications for prediction are direct. If 60 percent of a rating's variance reflects who is doing the rating rather than who is being rated, the rating's correlation with future performance is bounded by the share of variance that actually reflects performance.
That share — roughly 20 to 25 percent — sets a ceiling on what the rating can predict, and the ceiling is low. A H1 rating is a noisy signal of H1 performance, which is itself a noisy signal of H2 performance, which means a H1 rating's correlation with H2 ratings or outcomes is, in most empirical work, weak enough that the rating is functionally not predictive at the individual level.
The market data on perceived value tracks the empirical data closely. Gallup and Deloitte research finds that only 2 percent of CHROs believe their performance management system works. Just 32 percent of executives report that their performance approach enables timely, high-quality talent decisions. Sixty-four percent of workers describe formal performance reviews as "a complete waste of time that doesn't help them perform better." A Gallup analysis estimated that approximately one-third of the time, the traditional review process actively makes performance worse, by triggering disengagement and defensive behavior in the employee receiving the rating.
The companies that have publicly acted on this evidence — Adobe in 2012, Deloitte in 2015, General Electric and Accenture in the same period — did so after running internal analyses that confirmed the published research with their own data. Deloitte tallied close to two million hours per year invested in its performance management system before reforming it. Adobe estimated 80,000 manager hours per year before its overhaul. Both companies concluded that the system was producing ratings that primarily measured the raters and paperwork that primarily measured itself. The pattern repeats wherever the analysis is run rigorously: the cost is large, the predictive value is small, the satisfaction is low.
What is striking is not that some organizations have abandoned the practice. It is that most have not. Roughly half of U.S. companies still conduct annual or semi-annual reviews, often without ever running the internal analysis that would surface what the published literature already shows. The persistence of the practice is not evidence of its effectiveness. It is evidence that the cost is distributed (across thousands of manager hours) and the failure is invisible (because nobody measures whether the rating predicted anything).
The Three Functions the Review Fails to Do
Inside the mid-year review are three distinct functions that the format collapses into a single output. Understanding why the review fails requires separating them.
“Across the literature, the rater accounts for two to three times more of the rating’s variance than the ratee does — which means the rating is, structurally, two-thirds noise.”
Documentation
The first function is documentation. HR and legal teams need a record of performance issues, particularly for managing underperformance and defending against subsequent termination disputes. Documentation is real and necessary.
The mid-year review is a poor format for it because the documentation is periodic rather than event-driven: a problem that surfaces in February gets recorded in July, by which time the contemporaneous detail has degraded. Worse, the periodic format incentivizes manager smoothing — a single rating that summarizes six months tends to either round up (avoiding confrontation) or round down (managing toward calibration), neither of which preserves the discrete events the documentation function actually needs.
Feedback
The second function is feedback. Employees benefit from understanding how their manager perceives their work, what is going well, and where adjustment is needed. Feedback is real and necessary. The mid-year review is a poor format for it because the feedback is delivered at a delay of three to six months from the events it concerns.
By the time an employee learns in July about a behavioral pattern from March, the pattern has either continued (in which case the feedback is late by months) or resolved (in which case the feedback is irrelevant). Effective feedback is contemporaneous. The semi-annual format guarantees that effective feedback is structurally impossible.
Forecasting
The third function is forecasting. Leadership and HR need to identify high performers for advancement, low performers for management, and the broad middle for development investment. Forecasting is real and necessary.
The mid-year review is a poor format for it because the rating, dominated as it is by idiosyncratic rater effects, does not actually forecast individual performance. Companies use it as if it does. The result is promotion, comp, and development decisions made on a signal that is, structurally, two-thirds noise.
A CHRO at a mid-market industrial manufacturer described the realization:
"We ran the analysis on three years of mid-year ratings. We compared them against year-end ratings, against subsequent promotion decisions, against actual departure data, against everything we could connect them to. The correlations were everywhere from weak to nonexistent. The ratings were not predicting which of our managers would get promoted, who would leave, who would underperform, or who would surface as a problem the following year. We were spending close to 4,000 manager hours per cycle to produce a number that did not predict anything we needed to predict. The hardest part was admitting we had been doing it for fifteen years."
The bundling of the three functions is what makes the review costly without being useful. Each function has its own right answer. Documentation is best handled through event-based logging maintained continuously. Feedback is best delivered through structured one-on-ones at weekly or biweekly cadence. Forecasting, when done well, draws on leading indicators that have nothing to do with manager ratings. The mid-year review attempts all three simultaneously and produces an inadequate output for each, at a cost that, when honestly modeled, is consistently larger than any of the three functions would cost if pursued independently.
Call the cumulative cost of producing the bundled output the review tax: the manager hours, HR overhead, calibration meeting time, employee preparation time, software licenses, and downstream cost of decisions made on noisy signal. The tax is paid every cycle. It is rarely modeled honestly because the cost is distributed across departments and the failure mode is invisible to the people who would notice it.
Unbundling the Three Functions
The corrective is to do each of the three functions separately, with a format suited to each, and to stop using the mid-year review as a single instrument that cannot serve any of them well.
For documentation, the corrective is event-based logging. Managers maintain a continuous record of specific events — a notable success, a missed deliverable, a behavioral concern, a customer complaint, a peer commendation — entered as they occur. The record is a running file, not a periodic narrative. The HR system stores it; the manager and employee both have visibility into it; and the cumulative file becomes the documentation HR and legal need.
The file does not require a semi-annual rating to be defensible. It requires only that events be recorded with date, context, and specificity. Companies that have made this transition find that the documentation actually improves: it is more contemporaneous, more specific, and less subject to the manager-smoothing distortion that semi-annual ratings produce.
For feedback, the corrective is a structured one-on-one cadence. Weekly or biweekly, manager and employee meet for thirty to forty-five minutes with a defined agenda — current work, blockers, near-term priorities, growth conversations. The format does not require ratings, scoring, or written deliverables. It requires consistency and substantive content.
The empirical work on continuous feedback consistently shows higher engagement, faster correction of problems, and better employee perception of fairness compared to periodic review formats. Manager hours required are roughly comparable to the review-cycle hours when the alternative is properly accounted for, but the hours are distributed continuously rather than concentrated in two stressful windows per year.
For forecasting, the corrective is leading-indicator measurement that does not run through manager ratings. The signal stack approach — drawing on output quality, peer signals, behavioral consistency, engagement patterns, and structured calibration of specific outcomes — produces a forecast that is directly grounded in observable data rather than holistic ratings.
This is a meaningfully different practice than what most companies currently do. It requires investment in measurement infrastructure. It produces forecasting accuracy that ratings cannot match, because it is grounded in data that is not 60-percent rater variance.
This is the three-stream model: documentation, feedback, and forecasting handled as separate workflows with formats appropriate to each. None of the three is a "review" in the traditional sense. The review, as currently practiced, is the bundling problem. Removing the bundle is the corrective. The three-stream model produces better documentation, better feedback, better forecasting, and a smaller review tax than the practice it replaces.
Internal links between this approach and the signal stack framework apply directly: the leading indicators that predict departure also predict performance trajectory, and the same data infrastructure that supports retention monitoring at Margin supports performance forecasting at much higher accuracy than ratings deliver.
What the Replacement Requires
Adopting the three-stream model requires four changes most companies have not made.
The first is HR's willingness to give up the form-driven cycle. The mid-year review, as a process, is deeply embedded in HR operations: cycle planning, training materials, calibration meetings, performance management software, comp and promotion linkage, executive-level talent reviews. Dismantling it produces dislocation in every one of these places. The dislocation is real and survivable; the companies that have done it report a one-time period of organizational discomfort followed by sustained improvement. But the discomfort is precisely why most HR functions have not led the change. The CFO and CEO often have to push it.
The second is manager training in continuous feedback. Most managers are not skilled at substantive weekly one-on-ones. They default to status updates, project reviews, and tactical problem-solving — none of which constitutes feedback. Building the capability requires training, scripted prompts, and HR partnership during the first several months of the new cadence. Without it, the one-on-ones become administrative and the feedback function fails again, this time without the fallback of the periodic review.
The third is decoupling comp and promotion decisions from the rating cycle. In most companies, the mid-year and year-end reviews directly drive comp adjustments and promotion considerations. Removing the ratings forces a different basis for those decisions — typically a combination of leading indicators, calibrated specific outcomes, and structured peer input. The CFO and CHRO have to design the replacement decision architecture before dismantling the old one. Companies that abandon the review cycle without designing the replacement find themselves making comp and promotion decisions ad hoc, which is not an improvement.
The fourth is legal and compliance comfort with event-based documentation. HR and legal teams typically prefer the periodic review because it produces a familiar artifact: a rated, signed document on file. Event-based logs, properly maintained, are equally defensible — often more so, because the contemporaneous specificity is harder to challenge than a smoothed semi-annual narrative. But the legal team has to be brought along, and the documentation system has to be designed to produce what the legal team needs in a termination dispute. This is a tractable problem, but it requires direct engagement, not a memo.
These four changes do not save the review. They replace it with a different system that does the work the review was supposed to do. The new system is lighter on manager hours, heavier on capability investment, and structurally aligned with what the empirical evidence has been saying for twenty-five years. The companies that make the change recover the manager hours and spend them on the function the review was meant to support but consistently undermines: actual management.
The Choice the Evidence Forces
The companies that look at the empirical evidence and act on it will redirect manager hours from paperwork to actual management. The companies that defend the cycle will keep producing ratings that primarily measure their managers, paying review tax in the millions, and treating the output as if it were predictive of anything beyond the next rating.

