How to make it harder to game PR metrics | Lou Franco: code, apps, and writings

This is part of a series on shortening PR code review time. In the first post, I recommended gathering metrics (see If code reviews take too long, do this first). If you base a goal on these metrics, it’s possible that that metric will be gamed. This is not because of malice—it’s a well-known phenomenon (Goodhart’s Law), which just recognizes that it’s human behavior to make a metric improve if leadership says that it has to.

So, if you set a goal to lower the total time from PR creation to merge, then that time will probably shorten. You are hoping that we do this the right way by reducing bad comments and pre-checking PRs. But it’s inevitable that some improvement in this metric will be caused by behavior that you don’t want. For example, some bugs that could have been caught in review will make it to QA and production. This is the tradeoff if you want faster reviews—slow reviews are expensive too, and they are only worth it for expensive defects. Even if we agree that it’s cheaper to find problems in review than QA or production, there is still a cost where we’re fine with it being found in later (or never). But, if expensive issues are being found in production, it’s worth checking if that‘s because reviews are not thorough enough.

One way to make metrics less likely to be gamed is to pair them with a compensating metric that measures the effect of the gaming. So, in a call center, you might measure the time spent on a call, which you would want to decrease. To make sure that customer problems are still being solved, you might pair that with customer satisfaction scores or the number of times a customer has to call back. You want the pair of metrics you choose to both improve, not for one to improve at the expense of the other.

This is why I recommend pairing testing code coverage with some other metric that makes sure you get coverage of the most important or most risky code. For code review time, you might pair that with the DORA metric of “Change Fail Percentage” (percent of deploys with a bug requiring a hotfix or rollback). The pairing of metrics should imply a set of behaviors that can improve them both.

You could also go a step further and track the behaviors (or lead indicators) and not just the lagging indicators. A good lead indicator is a behavior, not an outcome, so it’s harder to game. For example, to make it more likely that a PR has a low idle time and is easier to review, it should be small. We don’t want to get into line counts, so just a high-level indicator of “excellent”, “acceptable”, and “unacceptable” PR size that is based on a fuzzy combination of the number of commits, changeset size, and maximum commit size could be enough. Just giving this indicator back to the author before they PR (with a local script) would encourage them break down their PR’s. If leads see undesirable behavior sneaking through, they could augment the script or override its assessment and reject the PR anyway.

Paired, complementary metrics are harder to game because the combination limits behaviors to those that can improve both metrics. A leading indicator is harder to game because it’s happening in real-time and you can react to it. So use these techniques to make sure your metrics are showing improvement in the underlying thing you care about.

Related Posts: