How to make it harder to game PR metrics

This is part of a series on shortening PR code review time. In the first post, I recommended gathering metrics (see If code reviews take too long, do this first). If you base a goal on these metrics, it’s possible that that metric will be gamed. This is not because of malice—it’s a well-known phenomenon (Goodhart’s Law), which just recognizes that it’s human behavior to make a metric improve if leadership says that it has to.

So, if you set a goal to lower the total time from PR creation to merge, then that time will probably shorten. You are hoping that we do this the right way by reducing bad comments and pre-checking PRs. But it’s inevitable that some improvement in this metric will be caused by behavior that you don’t want. For example, some bugs that could have been caught in review will make it to QA and production. This is the tradeoff if you want faster reviews—slow reviews are expensive too, and they are only worth it for expensive defects. Even if we agree that it’s cheaper to find problems in review than QA or production, there is still a cost where we’re fine with it being found in later (or never). But, if expensive issues are being found in production, it’s worth checking if that‘s because reviews are not thorough enough.

One way to make metrics less likely to be gamed is to pair them with a compensating metric that measures the effect of the gaming. So, in a call center, you might measure the time spent on a call, which you would want to decrease. To make sure that customer problems are still being solved, you might pair that with customer satisfaction scores or the number of times a customer has to call back. You want the pair of metrics you choose to both improve, not for one to improve at the expense of the other.

This is why I recommend pairing testing code coverage with some other metric that makes sure you get coverage of the most important or most risky code. For code review time, you might pair that with the DORA metric of “Change Fail Percentage” (percent of deploys with a bug requiring a hotfix or rollback). The pairing of metrics should imply a set of behaviors that can improve them both.

You could also go a step further and track the behaviors (or lead indicators) and not just the lagging indicators. A good lead indicator is a behavior, not an outcome, so it’s harder to game. For example, to make it more likely that a PR has a low idle time and is easier to review, it should be small. We don’t want to get into line counts, so just a high-level indicator of “excellent”, “acceptable”, and “unacceptable” PR size that is based on a fuzzy combination of the number of commits, changeset size, and maximum commit size could be enough. Just giving this indicator back to the author before they PR (with a local script) would encourage them break down their PR’s. If leads see undesirable behavior sneaking through, they could augment the script or override its assessment and reject the PR anyway.

Paired, complementary metrics are harder to game because the combination limits behaviors to those that can improve both metrics. A leading indicator is harder to game because it’s happening in real-time and you can react to it. So use these techniques to make sure your metrics are showing improvement in the underlying thing you care about.

How long PR code reviews lower developer productivity

I’ve been writing about code reviews for the past week because I think that shortening review cycles is one of the easiest ways to increase developer productivity. But I’ve taken that as a given so far because I’ve seen it on my own teams. A few years ago, I saw that the relationship between productivity and things like code review time has been supported by research, and I continue to see it in my consulting practice.

At Trello, I worked on the iOS app. We were a team of ten engineers who worked closely with the ten engineers building the Android app. We shared managers, PMs, designers, roadmaps, and thought of ourselves as the mobile team. We had shared processes, but since we were somewhat insulated from the other teams, with less intermingling, our processes were different from the rest of Trello.

At some point, a (non-mobile) engineer decided to gather metrics across all teams to see what he could learn about code reviews (see If code reviews take too long, do this first). He plotted various statistics and presented them for discussion. In slide after slide, the two mobile teams were outliers on nearly every statistic. The four metrics that stood out were: the size of the PRs, the number of PRs, PR idle time, and the total time to approval. Both mobile teams had the smallest and most PRs (normalized for team size). We also had the shortest times, which I think is a consequence of the first two metrics.

I can tell you from being on the mobile team in that time, that this was intentional and built into the mobile team culture. I was trained on our PR culture on my first day, when the entire mobile team was about five engineers, and we kept passing that on to new team members (see Construct PRs to Make Reviewing Easy). The short feedback loops were enabled by our careful PR construction and pre-checks, which I documented in PR Authors Have a lot of Control on PR Idle Time. When we dug deeper into the code review data, we found that the most senior iOS engineer and the most senior Android engineer were modeling that behavior (short and frequent PRs), which set a tone for the teams.

The shorter loops also contributed to my personal happiness at work and feeling of productivity. It was only after I left the team that I saw DevEx’s identification of feedback loops as a key driver of productivity. In the three years since I left, I’ve been seeing code review feedback loops as the easiest loop to shorten. It also has a large and immediate impact because it affects every engineer nearly every day.

According to DevEx, short feedback loops help by reducing context switching and enabling rapid course correction. I have seen that they also align estimated time with calendar time, which is the main source of the discrepancy in the feeling of productivity between engineering and everyone else (see Confession: I Estimate Using Time).

Some of the effect of shortening loops improves productivity without requiring us to “work faster”, but its effect on context switching is different. Reducing context switching does make us work faster. If we have to wait for days for a PR to be approved, then it takes longer to get started on the next step. We might have had to work on something unrelated in the meantime. That switch back to this PR (which might require review responses) causes a delay whose length is determined by the length of the waiting time. Having shorter code review loops means that we probably still have enough of the context in our heads and can respond faster.

So, reducing the wait time for a code review to start (PR idle time) and the total time it takes to approve a PR both reduces calendar time and context switches, which in turn reduces the time it takes to respond to a code review and to move on when the PR is part of a larger project.

If you still have nitpick PR comments

This is part of a series on reducing PR code review time, which is important because long feedback loops are one of the drivers of low developer productivity. Here are the other articles of the series with a TL;DR to remind you of the key takeaway, but there are details in the article of how to do it.

  1. If code reviews take too long, do this first (TL;DR: gather data)
  2. What to do about nitpick comments in PRs (TL;DR: eliminate them)
  3. What to do about excessive (good) PR comments (TL;DR: make better PRs)
  4. Not all PR nitpick comments are bad (TL;DR sometimes you need or want them)

If you are still getting nitpick comments after you have done what you can to eliminate them, try adding a checklist for comments. Here’s an example checklist to start from:

  1. Is it a bug?
  2. Does the code violate the specification?
  3. Does the code violate our design system?
  4. Is the code reimplementing something where it should reuse our library code?
  5. Does the code not meet our style guide?
  6. Does the code not meet our testing code coverage requirement?
  7. Is the code not consistent with the vast majority of our codebase (especially recently approved code)?
  8. Does the suggestion have an obvious, objective way to resolve it?

If a suggestion doesn’t meet your criteria for a good comment, then try one of these techniques to improve the comment

  1. Reframe nitpick suggestions as an (objective) defect with a clear way to resolve it. So, for example, if you “don’t like” a variable name, give a reason why you don’t like it. Some examples that are actually defects: (a) the variable is two letters and you have a standard that requires four or more (b) the variable doesn’t match your naming convention (for example, you always use the word “count” instead of “length”) (c) you don’t understand the variable name — suggest better names — express how the name is confusing you, which is more objective than “not liking a name”.
  2. Reframe the suggestion as the beginning of a process to update coding standards. When new non-defect, but possibly good comments are warranted, they should start as suggestions that might change the coding standard. In this case, you are going to move the discussion to what ever process you use to change standards (e.g. a retro)
  3. Make the comment if we are in a situation that warrants it anyway: See Not all PR nitpick comments are bad

If the suggestion is still nitpick that you can’t improve, and it isn’t warranted anyway, don’t make it.

Not all PR nitpick comments are bad

This article is part of a series on reducing code review time. It helps if you have read these two first.

  1. If code reviews take too long, do this first (gather data)
  2. What to do about nitpick comments in PRs (eliminate them)

In the second article above, I define a nitpick comment as one that is just an opinion and not a defect, where a defect is something that doesn’t meet some objective criteria. A bug is an obvious example but so is a style-guide violation. A nitpick is a comment where the author and reviewer might disagree, but the issue isn’t objective, and there isn’t an obvious way to resolve the problem.

I recommend that you have almost none of these kinds of comments in a review, but there are a few cases where it’s ok.

  1. During onboarding: You should set the expectation to your new hires that the first month or so of PRs might have excessive commenting as a way of training on culture of the team. To make this less onerous, early PRs should be very small. The more junior the new hire, the longer this might last.
  2. If requested: Sometimes we are really do want to ask for opinions on code that isn’t ready for merge, but you want more eyes on it. In that case, authors should tag their PR as something that they would like more help on, and then any suggestions are fine. Authors that are doing new things in the codebase (and possibly setting new standards) should proactively request (nitpick) opinions.
  3. As a precursor to setting a standard: if you have a strong opinion that you think should become a team standard, then starting the conversation in a PR comment might be warranted. I would recommend that you quickly move it to whatever process that your team uses to change their standards (e.g. a retro). Use the PR as an example.
  4. If rare: If you think it won’t derail the PR and that the author would welcome it, then go ahead. A good example is pointing to an interesting technique that the author might not be aware of, but you think applies. This is more of an FYI than a request to change something.

What to do about excessive (good) PR comments

This post is part of a series on reducing the time it takes for a PR to go through code review. Here are the first two articles:

  1. If code reviews take too long, do this first
  2. What to do about nitpick comments in PRs

In step 1, we gathered data, and in step 2, we took steps to eliminate nitpick comments. But, we could still have slow code reviews because there is a lot of discussion about the pull request that needs to happen before the code is merged.

Some of this is fine. There will always be outlier pull requests that merit careful attention. However, if this is the norm, then it’s worth taking steps to reduce the code review feedback loop because (according to DevEx) long feedback loops are one of the drivers of low developer productivity.

When you have a lot of good commenting on a PR it might be a sign that the author isn’t checking their own code enough before they create the pull request. Spending an hour or so making sure that a pull request doesn’t have obvious problems saves the time of the reviewer but also reduces the feedback loop, which could go over several days.

On my team at Trello, almost all PRs were approved within a day because we had a culture of making sure that PRs were easy to review. I have documented some of our practices here:

  1. Construct PRs to Make Reviewing Easy
  2. PR Authors Have a lot of Control on PR Idle Time
  3. Pull Requests for One

To construct a PR that is easy to review, the code must be correct and also easy to know that it’s correct. Here are some ways to reduce the time it takes for reviewers to approve PRs:

  1. Use automatic linters and code formatters that make it impossible to PR code that doesn’t meet your coding standards. The reviewer doesn’t need to check for style problems as they are impossible.
  2. Go beyond simple automations. Automated review tools exist for accessibility, security, code coverage, static analysis, etc. These should establish a baseline to assist the reviewer and not replace them. Reviewers should not look at PRs until they pass the automated checking.
  3. If you can’t automate something, create a short checklist for the author to go through before posting the PR.
  4. If it’s clear that a PR has lots of problems, a lead should have a 1:1 with the author to train them on self-checking and pair program a better PR with them. This avoids back-and-forth that might take days. The goal should be to get the PR into a state that the lead would approve.

When PRs are slowed down because of excessive back-and-forth between the author and reviewer, but all of that discussion was necessary, that is an indication that not enough work is being done by the author to make the code obviously correct. Don’t discourage the commenting or short-cut the review in this case.

What to do about nitpick comments in PRs

In “If code reviews take too long, do this first“, I said that the first step was to gather some data on your code reviews to identify pull requests that take longer than your goal time. One common reason I have seen is that code reviews take too long because there are excessive small, unimportant suggestions.

Not all small and unimportant suggestions are bad. The problem happens when a it’s just an opinion and the code isn’t objectively wrong. So, for example, a typo in a code comment is worth pointing out because the author can fix it without further questions, and the comment is improved. Another example is a clear violation of the style guide, like using a variable with underscore separators when your standard is camelcase. In these cases there is no argument that the original code is correct, and it’s trivial to fix and recheck.

But comments that are just an opinion like: “I don’t like this variable name” or “you could use map() instead of iterating the array” or “this function is too long” are a matter of opinion in most cases. It’s not as clear what to do about it and might end up causing some more conversation. The author’s second try might also not satisfy the reviewer. If this code was clearly wrong, that would be fine, but it’s not worth the extended time to fix a difference in opinion.

To address the problem of excessive nitpick suggestions, the team should adopt a standard that a PR comment should be pointing out a defect. Here is a short list of some obvious defects:

  1. An off-by-one bug in a loop
  2. The screen doesn’t match the specification
  3. An error case is not checked
  4. Typos in variables, function names, or comments
  5. Using tabs when team standard says to use spaces (or vice versa)
  6. Putting business logic in a view when the team requires an MVVM pattern
  7. The code reimplements a function when it should use the one in our shared library

This is not an exhaustive list, but the pattern should be clear. A comment is pointing out a defect when it can compare the code to some agreed upon requirement or standard. It’s also clear what to do about it, and that it must be done.

So, if you find in your data that you have a lot of nitpick comments, gather them all together and go through each one and categorize them:

  1. OK: If the comment is pointing out a clear defect against an agreed upon requirement, then that’s a good comment, and the problem is not the comment.
  2. To document: If the comment is not a defect, but you think the comment is good, and there is a pattern of this kind of comment, then maybe this belongs in your coding standard guide such that it’s clear that the code is defective and what exactly to do about it. After this is documented, future comments of this nature are OK.
  3. To eliminate: If the comment is not a defect and you don’t even think it should be a standard, then this comment should never have been made.

After doing this step, in the future, most comments should be OK, and sometimes there will be a comment where it’s unclear whether it should become a new standard or not an acceptable comment. Things like that can be hashed out using whatever process you have.

It could still be the case that there are too many comments and that is still the main reason that code reviews are taking too long. In the next post, I will explain what to do about the problem when you have too many good comments on your code reviews.

Seven Years of Slow and Steady Progress Towards 30×500

I just looked it up—I enrolled in Amy Hoy and Alex Hillman’s 30×500 course in 2017. The course teaches their technique for starting a small subscription-based business that gets $30/month from 500 customers. 2017 was also the year that Atlassian acquired (my employer) Trello. Since I was locked into a 4-year vesting period, and I wanted to work at Atlassian for at least that long anyway, I was never going to start a business right away. But I thought it would be a good idea to lay some groundwork so I could be ready whenever I decided to leave Atlassian.

The first step in 30×500 is to pick an audience—ideally one that you are a part of or that hires you. I picked iOS developers. So I started adding educational content to App-o-Mat, a site I started a few years earlier to build on my book, Hello! iOS Development, but that I had neglected. With this newfound focus, the posts I wrote in 2017 helped me become a more frequent contributor (and editor) of the mobile section for Smashing Magazine. The money I made from Smashing paid for the 30×500 course several times over, but it was hard to keep writing and also excel at my day job, and so I let the writing wane over the next few years.

In 2021, after I resigned from Atlassian to go independent, I decided to get more serious about applying 30×500 techniques. I concentrated on WatchKit and Workout tutorials because I could repurpose what I had learned writing Sprint-o-Mat. But my consulting business became more software business coaching and less programming, and I realized I wasn’t really in the iOS developer audience any more. So, I started searching for a new one.

I got serious about writing educational content along a lot of different dimensions including software (of course, but more like what I was consulting about), but also job searching, personal finance, and writing. Over the next couple of years, I wrote hundreds of posts. Looking them over at the end of 2023, I realized that I had a lot to say about technical debt.

At the beginning of 2024, my goal was to write a small (50 page/10k words) e-book—more of a pamphlet—on technical debt. This led to a guest article on The Pragmatic Engineer (so I was continuing to get paid to write). Mostly, I treated this as signal that that the content was useful. But the article helped me build an email list and led to podcast invites and webinars, and the content grew based on what I was being asked about. Last week, I sent a document with 40k words to my editor—my hope is that it will be ready to publish by Q1 next year.

So, here I sit, seven plus years after I took the 30×500 course—I still don’t have even 1 of the 500 paying me $30/month. However, the course long ago paid for itself in direct dollars for my writing and easily 100x in terms of the consulting work the writing helped me land and service. The secret sauce of their method leads to finding an endless supply of topics to write about. Honestly, I barely do it right, but just going in the right direction has been good enough.

It helped me develop enough content for a book and the seeds for a few more. I do think there’s a subscription business here. Or maybe a YouTube channel? Or maybe just what it is so far: a blog, a podcast, a book, a source of leads, and proof to myself that I’m a writer.

Book Title as Visual Metaphor

I’m reading a book called Nonfiction Alchemy [affiliate link] by Jordan Ring that has some advice on finding a voice. A lot of it resonates with what I wrote in Writing as Yourself to Yourself. For example, in a section called Write with Heart, he advises you to “write like you speak” to find your voice, to tell stories because “they bought your book, not someone else’s”. But the part where he talked about his title and his voice was something I had not seen before:

I debated writing a more streamlined “business type” book with an unoriginal title like “Write Your Damn Book!” I thought about using straightforward chapter titles and letting less of my personality shine through. I summarily shut down this line of thinking.

He goes on to talk about how the word “Alchemy” is evocative and that he is using it to find other (more original) words to describe his ideas. He talks about spells and potions and elixirs, and it just makes the prose more enjoyable, original, and memorable.

It reminds me of an exercise from Writing Down the Bones by Natalie Goldberg [affiliate link]. I talked about it in my podcast episode, Finding Nouns and Verbs. Here’s how I described it:

Take out a piece of paper. Divide it into three columns. In the first column, write down any ten nouns. Then, fold back it over so you can’t see those nouns and look at the second column.

At the top of the second column write down an occupation. She gives examples like chef, pilot, or doctor. Then, for the occupation that you chose, list all the verbs related it. In the book, Goldberg picked chef and then wrote down cut, slice, fry, marinate, bake, boil, etc. Write down as many as you can think of. You don’t need to limit yourself to 10. In fact, try to come with a lot more than 10.

Finally, open up the paper. Now you can see both columns. A column of nouns and a column of verbs. Pick a random noun and then find a verb to go with it and complete the sentence.

This will get you thinking of and using concrete and specific nouns and verbs that exert themselves to describe your scene. You won’t need to rely on adjectives and adverbs as much, and when you do, they’ll have more impact.

Thinking along these lines, with my title now Swimming in Tech Debt and the central metaphor being a swimmer trying to go upstream and being thwarted by debt. In my book, I show how to use the existence of tech debt as a way to propel yourself. The payment of the debt gives you energy and joy that puts you in the flow and makes you go faster. Even the word “flow” serendipitously is related to swimming and probably belongs in the subtitle.

Using the Goldberg technique, I could make a list of verbs that describe what a swimmer does: stroke, push-off, flip, wade, coast, kick, etc. This is of course, a good use of ChatGPT, which is a kind of super-thesaurus. It says: Swim, wade, stroke, paddle, dive, float, glide, submerge, surface, tread, splash, flip, scull, plunge, sink, kick, sprint, roll, breathe, and relax.

The idea is to use those words in the text sometimes instead of the obvious one. Enough that it brings some life to the text, but not so much that it seems like a gimmick.

Titling a book

The working title of my book has been Pay Tech Debt to Go Fast Now. I chose this because it’s the short answer about what I think you should do if you have a lot of tech debt, which is to concentrate your payment efforts on short-term developer productivity. The book is the long answer with lots of recommendations of how to do it.

But, I haven’t been satisfied with it as a title. There’s a passage in my book that seems to have resonated with a early readers, and I’m using that as a signal that it would be a source of a good title:

It’s been hard for me to talk about technical debt outside of engineering. The problems we tackle only exist inside the codebase, which is invisible to stakeholders, but it’s the water we swim in.

I don’t know how to explain that to others that don’t live in water. To us, working in a codebase with a lot of debt is like swimming upstream. It resists us moving in the direction we want to go. We eventually get there, but everyone else just sees the result and doesn’t feel the resistance. If we are slowed down, it just looks like we’re slow swimmers.

In the Write Useful Books, Rob Fitzpatrick recommends making a promise to the reader and putting it on the cover of the book. He says that it could be the title or subtitle. I overindexed on how good “Write Useful Books” is as a title, which is how I ended up with mine. But, I think it breaks down when the promise can’t be short. I think you need something easy for someone to remember and recommend. Pithiness is important.

I am also reading Nonfiction Alchemy [affiliate link], and in it Jordan Ring talks about having a central metaphor to use as a source of vocabulary in the book. “Alchemy” is his example, and the chapter titles are drawn from the same metaphor. I’ve been thinking about that idea too.

Yesterday, I was in Barnes & Noble and just staring at the non-fiction bookshelves with my wife and was telling her about this problem. We kept throwing out terms and while we were there, she came up with “Escaping the Tech Debt Trap”. I like “trap” and one of my favorite books is The Pleasure Trap [affiliate link], so I was drawn to it. I tried to think visually about the idea of a trap.

Then, I remembered that “swimming” passage in my book and thought about swimming in a river and going upstream because I needed to get somewhere, and how I could use the obstacles as a handhold, and push off them to propel myself. The visual helped me realize the physicality of my recommendations. Then, I thought of the title Swimming in Tech Debt, and I liked that it had a double meaning. There’s a play on words that sounds like it could mean something similar to “drowning in tech debt” (being overwhelmed), but what I mean is how to get through it, how to swim through it.

So, anyway, that’s the new title.

If code reviews take too long, do this first

Short feedback loops are one of the drivers of productivity according to the DevEx model. On my team at Trello, we had a goal of all reviews being done inside 24 hours. Having that goal drove behaviors that made most reviews complete in a few hours. So, to start, collect data and get on the same page.

If your reviews are taking too long, try these enabling steps first:

  1. Gather metrics: If you use GitHub, try this repository metrics script to get a baseline.
  2. Get consensus: Nothing will happen unless the whole team is on board with this being a problem and that it can be fixed.
  3. Set a goal: I know from experience that 100% of reviews in less than 24 (work) hours is possible. If that seems out of reach, set something that you could accomplish in a quarter.
  4. Inspect outliers: Treat outliers like you would treat an outage incident.
  5. Compare reviews that met the goal to ones that didn’t: Gather statistics about PR’s and see if you can find differences between the ones that did and didn’t. For example: number of lines changed, the author, the reviewer, the number of commits, the part of the codebase, etc.
  6. Put real-time monitoring in place: If you are the lead, just do this manually to start. At the beginning of the day, make sure all of yesterday’s PRs are going to be reviewed soon.

Tomorrow, I’ll write about some common problems and what to do about them.