Category Archives: Software Development

If code reviews take too long, do this first

Short feedback loops are one of the drivers of productivity according to the DevEx model. On my team at Trello, we had a goal of all reviews being done inside 24 hours. Having that goal drove behaviors that made most reviews complete in a few hours. So, to start, collect data and get on the same page.

If your reviews are taking too long, try these enabling steps first:

  1. Gather metrics: If you use GitHub, try this repository metrics script to get a baseline.
  2. Get consensus: Nothing will happen unless the whole team is on board with this being a problem and that it can be fixed.
  3. Set a goal: I know from experience that 100% of reviews in less than 24 (work) hours is possible. If that seems out of reach, set something that you could accomplish in a quarter.
  4. Inspect outliers: Treat outliers like you would treat an outage incident.
  5. Compare reviews that met the goal to ones that didn’t: Gather statistics about PR’s and see if you can find differences between the ones that did and didn’t. For example: number of lines changed, the author, the reviewer, the number of commits, the part of the codebase, etc.
  6. Put real-time monitoring in place: If you are the lead, just do this manually to start. At the beginning of the day, make sure all of yesterday’s PRs are going to be reviewed soon.

Tomorrow, I’ll write about some common problems and what to do about them.

Four Ways to Augment Code Coverage

Code Coverage by itself is a hard metric to use because it can be gamed, and so it will suffer more from Goodhart’s Law, which is summarized as “When a measure becomes a target, it ceases to be a good measure.” Goodhart’s Law observes that if you put pressure on people to hit a target, they will, but maybe not in the way you wanted.

And this would happen with code coverage because we can always increase coverage with either useless tests, tests of trivial functions, or tests of less valuable code.

I use these metrics in combination with coverage to make it harder to game:

  • Code Complexity: The simplest way to do this is to count the branches in a function. I use extensions in my code editor to help bring complex code to my attention. If coverage of the function is also low, I know that I can make the code less risky to change if I test it (or refactor it).
  • Usage analytics: If you tag your user analytics with the folder that the code generating it is in, you can later build reports that you can tie back to your coverage reports. See Use Heatmaps for iOS Beta Test Coverage. In that post, I used it to direct manual testing, but it would work for code coverage as well.
  • Recency of the code: To make sure that my PRs have high coverage, I use diff_cover. This makes it more likely that my tests are finding bugs in code that is going to be QA’d soon and has already been deemed valuable to write. Very old code is more likely to be working fine, so adding tests to it might not be worth it. If you find a bug in old code worth fixing, it will generate a PR (and become recent code).
  • Mutations: I am still trying to find a good tool for this, but this lets you test the quality of your assertions in addition to your coverage. I do it manually now.

Generally, the way to make a metric harder to game is to combine it with a metric that would be worse if it was gamed in ways you can predict (or have seen).

Invaders game screenshot

Play Invaders on Glitch

My nephew and I are meeting once a week to make video games. We are using Phaser (and Javascript) as our game engine and Glitch as our coding IDE.

Here’s one of our game: Invaders. Since it’s on Glitch, you can see all of the code and “remix” it into another game. In the constructor of the Invaders class there are a lot of member variables that you can can to tweak the game to your taste.

I’m sharing this because I think that to learn programming, you should Start with a Working System, not use tutorials. This is more like what real programming jobs are anyway. Once you are comfortable with what the code does, then build a new game from scratch by copying over pieces a little at a time as you need them. That’s what I did to learn Phaser. The game I used is out of date, but I’ll share my fork and update soon.

In this game, we use

  • Sprites
  • Animations
  • Sounds
  • Keyboard controls
  • Collision detection
  • The physics engine: so that we can use velocity instead of updating the positions manually

You can make a lot of games with just those basic tools.

Apples and Oranges are (Relatively) Easy to Compare

I don’t like stack ranking because it’s hard to compare people into a rank ordering.

One of the most surprising things I learned in math was that complex numbers had no natural ordering. Meaning, less-than and greater-than are not defined for complex numbers. It makes sense when you think of it for a minute. The same applies to other multi-dimensional things like matrices and vectors.

So, why do we think we can rank order people? I’m specifically talking about companies that do this for their employees, but it comes up in other contexts (e.g. class rank).

People are hard to compare, but when we say that it’s like comparing apples and oranges, I disagree. Apples and oranges are both fruit, they both have around the same number of calories, they are about the same size, shape, and cost. They are easy to turn into snackable pieces (slices or segments). Even for me personally, I like them both about the same. On a lot of dimensions, they are about the same. When that’s true, the comparison might turn into just a single dimension where they vary more—maybe only in specific situations.

For me, the main way they are different is in how they travel. Eating an orange is more of a mess and harder to deal with outside of my kitchen. I’m much more likely to grab an apple to throw in my hiking bag or take to the beach. Another way is in recipes. I know a lot more apple desserts than orange ones. It stands up to baking better.

But, how about apples and raisins, or apples and candy, or apples and tempeh, or apples and bicycles. Those are harder to compare because they vary on more dimensions. In the bicycle case, they don’t even share dimensions except generic ones—they are both objects that have size and weight.

Getting back to stack ranking (which I still don’t like). Inside of a team it makes no sense to me. You would have a mix of levels and experience. That mix makes the team valuable, and arbitrarily favoring a dimension hurts the mix.

Like comparing apples and oranges (which is easy), it would work better if you could remove dimensions and only compare one or two. So, for example: compare just your backend senior developers with each other on just system design skill. You could reduce the set to just those with two years at this level. This might be useful when considering a promotion. In this situation, you might value mentoring and consensus building skills more than in-depth knowledge of TypeScript. So, it’s situational (like which fruit to use for a pie) and has reduced dimensions. Another advantage is that you don’t need a full ordering to complete the task.

Leet Code at Work

I prefer work simulation questions to leet code questions for tech interviews. I like to ask interviewees to write code that is similar to what we actually did was better than, for example, finding a successor node in a BST. At Trello, our tech interview would have you refactoring code in a way that is very common in iOS or implementing a UI from a spec. At Atalasoft, we had a lot of image processing algorithms in our code base, so I wanted to see you do something simple with pixels.

The other day I was thinking about my career and trying to remember if I ever did have to code a custom algorithm given a specification, and I did come up with a few examples. I’ve written before that my career happened to have a lot of math in it, and those same jobs sometimes also needed me to implement algorithms.

But more often, I chose algorithms (or knew that I needed to). I think that’s a more universally useful skill. It’s often the case that something just isn’t fast enough. A lot of published and common implementations of algorithms work well for the general case, but you may be able to make some assumptions in your specific application that allow you to do something better. Or your particular workloads might make a different set of trade-offs more appropriate.

To do this, it’s good to have broad knowledge of a bunch of choices, just so you know what techniques might be possible. These days, I think AI can help you a lot with this, but it helps to know what to ask for and when to ask for it.

Applying my first two programming lessons to teaching kids to program

I have been meeting with my 10 year old nephew about once a week to make video games together. It’s mostly just to have fun and for me to also chat with his dad, who helps on his side—we’re across the country from each other. His dad and I went to middle school together and got kicked out of wood shop and into a computer “shop” where we learned to program on the Commodore PET. We were 13 at the time and learned enough BASIC to make simple games. So, in a way, we’re carrying on that tradition.

My first program back then was to draw on the screen given a list of coordinates. It was about 5 lines of real code (and a dozen lines of DATA). It was easy to understand and it put stuff on the screen right away, which was my first lesson in programming.

It’s hard to do that in modern programming, especially if you want to be able to keep going on to build something more real. There’s just so much boilerplate and ceremony to just get something on the screen.

After some trial and error, I have settled on using Phaser as our game engine and Glitch as our editor.

Glitch lets you edit code like a google doc. We can both be in the editor and change code simultaneously. Ironically, it took me time to remember Glitch even though I got to witness its genesis inside of Fogcreek when I was working at Trello.

Glitch is easiest to use with JavaScript, so I had to pick a JS game engine. Since someone already had made a remixable Phaser game, I started with that. My second lesson in programming is that it’s easier to learn how to modify a working thing than to make something from scratch. This is really an extension of the first lesson—when you are making something bigger, it takes too long to get something working, so to learn how to make big things, start with the big thing already done. First you modify parts to get a sense of all the pieces, and then later, you will be able to make one from scratch.

I’ll post some of our games next week. They are mostly simple ports of classic 2D games from the 80’s and 90’s.

Mutation Testing

A couple of years ago, I wrote about a testing technique that I had learned, but didn’t remember the name of, so I called it code perturbance. I mentioned this technique in my book, and a helpful beta reader let me know the real name: mutation testing.

The idea is to intentionally change the code in a way that introduces a bug, but is still syntactically correct (so you can run it). Then, you run your test suite to see if it can find the problem. This augments code coverage, which will only let you know that code was run in a test, not if an assertion was made against it.

Now that I know the name, I can find out more about it in google. For example, there are tools that can do it for you automatically. The one that I’m most interested in is Stryker-mutator, because it supports TypeScript. I’ll report back when I try it.

Add Date Tests for Daylight Savings

I wrote my iOS App, Habits, after I had been at a job with very strict daylight savings coding rules, so it has lots of unit tests that take the code across the daylight savings/standard time boundaries.

But now it’s been almost 20 years (!) since I worked there, and I’ve gotten bad habits, which bit me last week. On Nov 3rd, at 2am, and for the next week, I had a bug in my code for calculating the date that was “one week ago” because I used JavaScript’s setDate() instead of setUTCDate().

Luckily, my app is still being written, not in production, so it was only a problem in my unit tests. The tests use the current date, and so they inefficiently find daylight savings problems.

The right way is to also have tests that specifically set the date to known daylight savings boundaries and the dates around them.

Technical Debt Typology Research Paper

A few months ago, I got an email from Mark Greville that included a link to a research paper he coauthored, called A Triple Bottom-line Typology of Technical Debt: Supporting Decision-Making in Cross-Functional Teams.

In the paper, the authors identify several categories of tech debt. One category is internal vs. external effects. In my book, I also identified the external category, which I call visibility. The paper thinks of the entire business as the “internal”, but I think of the team itself as the internal part. My separation is driven by difference in communication that the engineering team needs to use for itself vs. the rest of the business. Customers and other public stakeholders would likely be similar to the non-engineering business teams.

Since a lot of my book is about how tech debt affects developer productivity, I break down internal to the various ways it could reduce productivity. I use misalignment to describe tech debt that doesn’t meet the documented standards of the team. When the code is hard to change, for example if there’s messy code all over the codebase (Marbleized Code Fat), I call that resistance. If the code does what it’s supposed to (so no external effects) and the customers highly depend on its behavior, I warn about the risk of regressions.

Another pair they describe is whether the tech debt is taken knowingly or unknowingly. This is useful from a taxonomy perspective and might contribute to tech debt avoidance, but in my book, I write:

I don’t think of tech debt as the result of an intentional shortcut borrowed from the future. Some debt starts that way, but the reality is that lots of tech debt happens because the world changes. Even if your system represents your best ideas of how to solve the problem at hand, your ideas will get better, and the problems will change. You can do everything right and still have bad code, so it doesn’t help to judge the decisions that got us there. Learn from them, but it’s counter-productive to dwell on them.

My chapters on these dimensions focus on using them to decide what to do about the debt, and I don’t think intention is a factor in deciding what to do next.

The paper is worth a look and also has quite a good bibliography if you are interested in research on tech debt. Since the methodology of the research included a literature review, the list of references reviewed is another treasure trove of research.

Adam Tornhill on Tech Debt’s Multiple Dimensions

In the research for my book on technical debt, I ran into this talk by Adam Tornhill:

Adam has a similar perspective to mine: technical debt is multi-faceted, and the right strategy should address its various dimensions.

One of his examples is combining a code complexity metric with data from your source code repository to define low code health hotspots—areas where code is both complex and frequently changed. To find the hotspots, he built a tool to calculate this metric and visualize it. In the video he shows data from big open-source projects (like Android and .NET core) and pinpoints areas that would benefit from work to pay down debt.

Similarly, in my book, I identify eight dimensions of debt. Complexity is something I consider to be part of Resistance, which is how hard or risky it is to change the code. I would also incorporate low test coverage into resistance, as well as subjective criteria. Adam says that complexity is a good estimate of how many tests you need, which is true, and I give you credit for having the tests. I am mostly concerned by complex code that is undertested.

Like Adam, I believe that bad code only matters if you plan to change it. He believes that the repository history of changes is a good indication of future change, which I agree with, but to a lesser degree. In my book, I recommend that you look at the history and shared this git log one-liner as a starting point:

git log --pretty=format: --name-only --since=3.months | sort | uniq -c | sort -rg | head -10

That line will show you the most edited files. To find the most edited folders, I use this

git log --pretty=format: --name-only --since=3.months | sed -e 's/^/\//' -e 's/[^\/]*$//'  | sort|uniq -c|sort -rg|head -10

This data contributes to a dimension I call volatility. It’s meant to be forward looking, so I would mostly base it on your near future roadmap. However, it is probably true that it is correlated with the recent past. In my case, this data is misleading because I just did a reorganization of my code to pull out a shared library to share between the web and mobile versions of my app. But, knowing this, I could modify the time period or perhaps check various time periods to see if there’s some stable pattern.

Generally, my opinions about tech debt and prioritization are very aligned with what’s in this video, especially the multi-faceted approach.