Category Archives: Software Development

Observations on the MIT Study on GitHub Copilot

I just saw this study on GitHub Copilot from February. Here is the abstract:

Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to implement an HTTP server in JavaScript as quickly as possible. The treatment group, with access to the AI pair programmer, completed the task 55.8% faster than the control group. Observed heterogenous effects show promise for AI pair programmers to help people transition into software development careers.

The researchers report benefits to less experienced developers, which is at odds with this other study I wrote about and my own intuition. However, all of the developers were experienced Javascript developers, and not literally learning programming, which is where I think the more detrimental effect would be.

Using Zeno’s Paradox For Progress Bars

When showing progress, if you have a list of a known length and processing each item takes about the same time, you can implement it like this pseudocode:

for (int i = 0; i < list.length; ++i) {
    process(list[i]);
    // notifyProgress takes a numerator and denominator to 
    // calculate percent of progress
    notifyProgress(i, list.length);
}

One common problem is not knowing the length beforehand.

A simple solution would be to pick a value for length and then make sure not to go over it.

int lengthGuess = 100;
for (int i=0; list.hasMoreItems(); ++i) {
    process(list.nextItem());
    notifyProgress(min(i, lengthGuess), lengthGuess);
}
notifyProgress(lengthGuess, lengthGuess);

This works ok, if the length is near 100, but if it’s much smaller, it will have to jump at the end, and if it’s much bigger, it will get to 100% way too soon.

To fix this, we might adjust lengthGuess as we learn more:

int lengthGuess = 100;
for (int i=0; list.hasMoreItems(); ++i) {
    process(list.nextItem());
    if (i > 0.8 * lengthGuess) {
        lengthGuess = 2*i;      
    }
    notifyProgress(i, lengthGuess);
}
notifyProgress(lengthGuess, lengthGuess);

In this last example, whenever i is 80% of the way through, we set lengthGuess to 2*i.  This has the effect that the progress goes back and forth between 50% and 80% and then it jumps to the end.  This won’t work. 

What I want is:

  1. The progress bar should be monotonically increasing
  2. It should get to 100% at the end and not before
  3. It should look as smooth as possible, but can jump

An acceptable effect, would be to progress quickly to 50%, then slow down to 75% (50% of the way from 50% to 100%), then slow down again at 87.5% (halfway between 75% and 100%), and so on.  If we keep doing that, we’ll never get to 100% in the loop and can jump to it at the end. This is like Zeno’s Dichotomy paradox (from the Wikipedia).

Suppose Homer wants to catch a stationary bus. Before he can get there, he must get halfway there. Before he can get halfway there, he must get a quarter of the way there. Before traveling a quarter, he must travel one-eighth; before an eighth, one-sixteenth; and so on.

To do that we have to keep a factor to use to adjust the progress we’ve made (playing around with it, I found that using a factor of 1/3 rather than 1/2 was more pleasing).

int lengthGuess = 100;
double begin = 0;
double end = lengthGuess;
double iFactor = 1.0;
double factorAdjust = 1.0/3.0;
for (int i = 0; list.hasMoreItems(); ++i) {
    process(list.nextItem());               
    double progress = begin + (i - begin) * iFactor;
    if (progress > begin + (end-begin) * factorAdjust) {                   
        begin = progress;
        iFactor *= factorAdjust;
    }               
    notifyProgress(progress, lengthGuess);
}
notifyProgress(lengthGuess, lengthGuess);

The choice of lengthGuess is important, I think erring on too small is your best bet.  You don’t want it to be exact, because we’ll slow down when we get 1/3 toward the goal (factorAdjust).  The variables lengthGuess and factorAdjust could be passed in and determined from what information you have about the length of the list.

How to fix WCErrorCodePayloadUnsupportedTypes Error when using sendMessage

If you are sending data from the iPhone to the Apple Watch, you might use sendMessage.

func sendMessage(_ message: [String : Any], replyHandler: (([String : Any]) -> Void)?, errorHandler: ((Error) -> Void)? = nil)

If you do this and get the error WCErrorCodePayloadUnsupportedTypes this is because you put an unsupported type in the message dictionary.

The first parameter (message) is a dictionary of String to Any, but the value cannot really be any type. If you read the documentation, it says that message is

A dictionary of property list values that you want to send. You define the contents of the dictionary that your counterpart supports. This parameter must not be nil.

“property list values” means values that can be stored in a Plist. This means you can use simple types like Int, Bool, and String and you can also use arrays and dictionaries as long as they are of those simple types (e.g. an Array of Ints)

I ran into this issue because I tried to use a custom struct in the message dictionary, which is not supported.

Note: I made this post because google is sending people to Programming Tutorials Need to Pick a Type of Learner because it mentions WCErrorCodePayloadUnsupportedTypes incidentally, but isn’t really about that.

Pre-define Your Response to the Dashboard

A few days ago, I wrote about using Errors Per Million (EPM) instead of success rate to get better intuition on reliability. I also recently said that Visualizations Should Generate Actions. Sometimes it’s obvious what to do, but if not, you can think through the scenarios and pre-define what actions you would take.

Here’s an example. This is a mock up of what a dashboard showing EPM over time might look like. The blue line is the EPM value on a date:

The three horizontal lines set levels of acceptability. Between Green and Yellow is excellent, between Yellow and Red is acceptable, and above Red is unacceptable. When we did this, we thought about using numbered severity levels (like in the Atlassian incident response playbook), but we decided to use Green/Yellow/Red for simplicity and intuition.

We also pre-defined the response you should have at each level. It was something like this:

LevelResponse
GreenNone
YellowThere must be at least one item in the current sprint with high priority to address this until the level is back to Green. It can be deployed when the current sprint is deployed.
RedAt least one person must be actively working to resolve the issue and doing hot fix deploys until the level is back to Yellow.

The advantage of this was that these actions were all pre-negotiated with management and product managers. This meant that we could just go ahead and fix things (at a certain level) instead of items getting lost in the backlog.

When we created this dashboard, we were in the Red, but we knew that going in. We worked to get ourselves Green and in practice, we were rarely not Green. This is another reason to pre-define your response, as it becomes too hard to remember how to handle situations that rarely happen.

When 99.8% Success Wasn’t Very Good

One of my projects at Trello was looking into and fixing a reliability problem we were having. Even though it seemed solid in testing, we were getting enough support tickets to know that it must be worse than we thought. We collected data on its success rate and found out that it was successful 99.8% of the time. To move forward with doing more work on it, I had to convince our team and management that 99.8% was bad.

At the time, there was a company-wide push at Atlassian to improve reliability, and there was a line-manager assigned to oversee it across Trello teams, so I spoke to him. I showed him the data, but also told him that there’s an overall feeling on our team that it’s affecting our customers more than it seems.

He suggested that I flip the ratio and instead look at the data as Errors Per Million. When you do that, with the same data, you get 2,000 errors per million attempts. This particular thing happened around 2 million times per day, so that was 4,000 errors per day. That partially explained the issue, because it wouldn’t take much for the support tickets to get out of control. Luckily, not all 4,000 were of the same severity, and many were being retried. Still, that number is too high.

The other thing I found after more analysis, was that the errors were not evenly spread across the user base. They tended to cluster among a smaller cohort, which was experiencing a much worse EPM than then rest of the users.

With that in hand, we greenlit a project to address this by targeting the most severe problems. We found several bugs. Eventually we got the success rate to 99.95%.

It’s not clear that 99.95% is four times better than 99.8%, but the equivalent EPM after the fixes is 500 (as opposed to 2,000 before). What surprised me is how different a reaction a high EPM got compared to equivalent numbers. 2,000 EPM is literally the same as 99.8% success and 0.2% failure rates, but the latter two seem fine. Even if I say we get 2 million attempts per day, it’s hard to intuitively understand what that means.

When I said 2,000 EPM, we instinctively felt that we failed 2,000 users. When I said we get 2 million attempts, everyone knows to double EPM to get 4,000 incidents per day, and we felt even worse. That simple change in reporting made all the difference in our perspective.

How I Use JIRA and Trello Together

I started using JIRA for issue tracking when I worked at Trello (at Atlassian), and I still use it now. JIRA does everything I need in managing software projects, but I never send people outside of my team to JIRA because it’s not easy for casual users. For that I use Trello.

I have a Trello board for each project I am managing that is meant to be a high-level summary of that project. It is useful for onboarding and getting its current status easily. It has links to JIRA, Confluence (for specifications), Atlas (for status) and Figma.

This Trello board is the first place I send a new team member to help with onboarding. If someone has a question in Slack about the project, I make sure that it was something you could find out on the board and then link them to it there. The board is a kind of dashboard and central hub of the project.

These hub boards are curated, so I don’t try to use any automations to bring things over. If I think you need more information, I send you directly to the source.

JIRA is useful to the people that work on the project every day. I use Trello for those that just check in weekly or monthly.

Visualizations Should Generate Actions

Yesterday, I shared a heatmap visualization that I used to target manual testing time. I chose a heatmap to show this data because you can tell what you need to do just by looking at it.

In this example

A heat map the test status of iOS devices across different features in an app

It’s pretty clear that you should get an iPhone 13 with iOS 15 on it and start testing everything. You could also explore board creation on all devices. If the entire heatmap were green, you would know that you had probably covered most areas.

It would be easy to write a program that took this same data and generated a to-do list instead. Maybe that would be preferable, but people like visual dashboards, and it’s easier to see the why behind the task if you have a sense of the underlying data.

But, that’s a big clue to whether your dashboard visualization works. If you could easily generate a to-do list just by looking at it, then it probably works. If you look at your dashboard and have no response, it might look pretty, but it’s not doing its job.

Use Heatmaps for iOS Beta Test Coverage

At Trello, I built a simple visualization for understanding coverage of our app during Beta periods. We used Mode to analyze data, and so I used their Heatmap.

Here’s a recreation in Google Sheets:

Along the top was each device family and OS. Individual devices were grouped based on how likely they were to be similar in testing (based on size, version, etc). I used this list of Apple device codes (which were logged with analytic data).

Along the left side were the most important screens and features. It was a much longer list that was generated from analytic categories.

The center of the visualization was a heat map based on how much usage that feature got on that device (at the cross-section) normalized against how much usage it got in production. So, if a cell was green, it meant that it was tested a lot when compared to how much it was used in production. If a cell was red, it meant it was under tested.

Often, entire vertical columns would be near red because the combination of device/OS wasn’t used much by our beta testers. So, we could direct our own efforts towards those devices and turn an entire column from red to green.

We also made sure new features would get their own row. These could also be red because beta testers might not know about them. We could similarly target those areas on all devices. These features could not be normalized against production usage (since they were not in production yet), so we use a baseline usage as a default.

Mode kept snapshots of the heatmaps over time. We could watch it go from nearly all red at the beginning of the beta period to more green by the end. I can’t say we could get the entire heatmap to be green, but we could at least make sure we were testing efficiently.

PR Authors Have a lot of Control on PR Idle Time

Getting a pull request reviewed quickly is often under the author’s control. This is great news because, according to DevEx, you should reduce feedback loops to increase developer productivity. There are other feedback loops that a developer experiences, but pull requests happen all of the time. You can have a big impact on productivity if they happen faster, and a big reason they don’t is because of the commits in the PR.

At Trello, during a hackathon, someone did an analysis on all of the PRs on all of the teams to see if they could get some insights. At the time, we probably had about 10 teams of about 7-10 developers each.

One thing they looked at was median time to approve a PR based on team, and there were two teams that were far outliers (with smaller waits for a PR to be approved). They went further to look at the PRs themselves and noticed that they generally had fewer commits and the commits themselves were smaller. The number of pull requests per developer-week was much higher than other teams.

I was on one of those teams, and the other one was very closely aligned with us (meaning we had a lot of shared processes and rituals). When we very small, 5 total developers, we were basically one team with a shared lead. The style of PR on these teams was very intentional. When I was onboarded, I was given very specific instructions on how to make a PR.

The essence of what we did was to completely rewrite the commits before making a pull-request to “tell a story”. I wrote about the details in Construct PRs to Make Reviewing Easy.

As a reviewer, this made it very easy to do approvals, which we could fit it in at almost any time. With all of us doing this, many PRs were approved within an hour and most in a few hours. A really good time to do some reviewing was right after submitting a PR, which made the throughput reach a steady-state. The PR list was rarely very long.

I worked in this style for the 6+ years I was on this team and know that it contributed to a high level of personal work satisfaction. Even though I had to wait for others to approve my work, I felt that my own productivity was largely under my control.

Related: Metrics that Resist Gaming

Just Started a New Software Engineering Job? Fix Onboarding

If you are about to start a new job as a software engineer, the way to have a big impact day one is to go through the onboarding with the intention to generate a list of improvements that you work on over time.

Here are some things to look for

  1. Incorrect or outdated information. If you find these, just fix them as you find them.
  2. Missing entry-point documentation. Even teams that have good documentation often do not have a document that is useful if you know nothing. Often they don’t go back and make a good “start here” kind of document.
  3. Manual steps that could be scripted. Don’t go overboard, but if you see some quick wins to automate the dev setup steps, it’s a good first PR. It’s a tech debt payoff that is timed perfectly.
  4. Dev Setup Automation bug fixes. If anything goes wrong while running the scripts to set up your machine, fix the bug or try to add in any detection (or better error messages) that would have helped diagnose the issue.

There is usually a lot of tech-debt in onboarding code and documents because no one really goes through them. Sometimes underlying things just change and tech debt happens to you. You are in a unique position to make it better for the next person and have some impact right away.