When 99.8% Success Wasn’t Very Good

One of my projects at Trello was looking into and fixing a reliability problem we were having. Even though it seemed solid in testing, we were getting enough support tickets to know that it must be worse than we thought. We collected data on its success rate and found out that it was successful 99.8% of the time. To move forward with doing more work on it, I had to convince our team and management that 99.8% was bad.

At the time, there was a company-wide push at Atlassian to improve reliability, and there was a line-manager assigned to oversee it across Trello teams, so I spoke to him. I showed him the data, but also told him that there’s an overall feeling on our team that it’s affecting our customers more than it seems.

He suggested that I flip the ratio and instead look at the data as Errors Per Million. When you do that, with the same data, you get 2,000 errors per million attempts. This particular thing happened around 2 million times per day, so that was 4,000 errors per day. That partially explained the issue, because it wouldn’t take much for the support tickets to get out of control. Luckily, not all 4,000 were of the same severity, and many were being retried. Still, that number is too high.

The other thing I found after more analysis, was that the errors were not evenly spread across the user base. They tended to cluster among a smaller cohort, which was experiencing a much worse EPM than then rest of the users.

With that in hand, we greenlit a project to address this by targeting the most severe problems. We found several bugs. Eventually we got the success rate to 99.95%.

It’s not clear that 99.95% is four times better than 99.8%, but the equivalent EPM after the fixes is 500 (as opposed to 2,000 before). What surprised me is how different a reaction a high EPM got compared to equivalent numbers. 2,000 EPM is literally the same as 99.8% success and 0.2% failure rates, but the latter two seem fine. Even if I say we get 2 million attempts per day, it’s hard to intuitively understand what that means.

When I said 2,000 EPM, we instinctively felt that we failed 2,000 users. When I said we get 2 million attempts, everyone knows to double EPM to get 4,000 incidents per day, and we felt even worse. That simple change in reporting made all the difference in our perspective.