Category Archives: AI

Use AI to Code Review Yourself, Not Others

In What to do about excessive (good) PR comments I wrote that a PR with a lot of good, actionable comments might be a sign that the author didn’t do enough beforehand to post an acceptable PR. It depends on many factors. For example, a new hire or junior developer might still need training, and so you would expect PRs to need some work,

But, there are some things that should never make it to review. We have linters and automatic code formatters, so we should never be correcting indentation in a review. If you use Rust, you are never finding a dangling pointer bug. There are automated tools for finding accessibility issues and security issues—which aren’t perfect, but they are probably better (and definitely faster) than most humans. All of these tools are somewhat mechanical, though, built on parser technology. We can do “better” with AI.

AI writes code good enough to be a starting point, and it can also review code. Since I think that the point of code reviews is information sharing, we wouldn’t want to replace them with AI. But AI is the perfect thing to use on yourself to make sure you don’t open PRs with obvious problems that slow down the review.

How to Reduce Legacy code

There are two books that I read that try to define legacy code: Working Effectively with Legacy Code [affiliate link] by Michael Feathers and Software Design X-Rays [affiliate link] by Adam Tornhill. Briefly, Feathers says it’s code without tests and Tornhill adds that it’s code we didn’t write. In both cases, there’s an assumption that the code is also low quality.

To deal with the first problem, you should of course, add tests, and Feathers gives plenty of tips for how to do that. I would add that you should use a tool like diff-cover to make that happen more often. Diff-cover takes your unit test code coverage reports and filters them down so that you can see the code changed in the current branch that isn’t covered. Often, that code is inside of other functions, so covering your code will reduce the legacy code you touched.

This makes sense because once you have changed code, you really can’t say that you didn’t write it. Tornhill’s book describes software he sells that identifies legacy code using the repository history. It highlights code that no current engineer has touched. Your edit would remove that code from the list.

But, the author of the PR isn’t the only person who should be considered to know this code—the code reviewer should be counted too. If they have never seen this part of the codebase before, they should learn it enough to review the code. This is why I don’t like the idea of replacing human reviewers with AI. It has a role in reviews, but you can’t increase the number of developers that know a portion of a codebase by letting AI read it for you.

Every part of the codebase that is edited or reviewed by a new person reduces its “legacyness”. So does every test. If your gut tells you an area of the code is legacy, check the repo history and coverage. If you have a budget for dealing with engineering-led initiatives, assign a dev who hasn’t made edits to that area to write some tests.

LLMs Talk Too Much

The biggest tell that you are chatting with an AI and not a human is that they talk too much.

Yesterday, to start off a session with ChatGPT, I asked a yes/no question and it responded with a “Yes” and another 325 more words. No person would do this.

Another version of this same behavior is when you ask an LLM a vague question—it answers it! No questions. Just an answer. Again, people don’t usually behave this way.

It’s odd because in the online forums that were used to train these models, vague questions are often called out. It’s nice that the LLM isn’t a jerk, but asking a clarifying question is basic “intelligent” behavior and they haven’t mastered it yet.

LLMs Can’t Turn You Into a Writer

In Art & Fear, the authors write:

To all viewers but yourself, what matters is the product, the finished artwork. To you and you alone, what matters is the process, the experience of shaping that artwork. The viewers’ concerns are not your concerns. Their job is whatever it is, to be moved by art, to be entertained by it, whatever.

Your job is to learn to work on your work.

If you use LLMs to write for you, you will end up with writing. You’ll pass your class, get by at work, or get some likes on a post. But you will not become a better writer.

It might even be hard to become a better judge of writing. If you do, it won’t be by reading what LLMs bleat out. It won’t be by reading their summaries of great work. “Your job is to learn to work on your work” and to do that you need to do your own writing and limit your reading to those that do the same.

Humans Have Been Upgraded by LLMs

If, twenty years ago, you took the most sophisticated AI thinkers in the world and set up a Turing test with a modern AI chatbot, they would all think they were chatting with a human. Today, that same chatbot would have a hard time fooling most people that have played around with them.

In a sense, humans have been upgraded by the existence of LLMs. If we had no idea that they existed, we would be fooled by them. But, now that we’ve seen them, we’ve collectively learned how they come up short.

Gaming ChatBot Recommendations

I just had conversation with ChatGPT that ended with it making product recommendations. I was asking about documentation, so it suggested Confluence and Notion.

Those are good options and do match what I was asking about. But, if ChatGPT is going to make product recommendations, then companies will try to game this like they do for search engines.

Before Google, search engines ranked pages on the count of keywords in the document. That was easy to game by just stuffing the keywords. Google solved this by ranking the search results by backlinks. They reasoned that pages with the most incoming links must be better because a backlink was like a vote of confidence. Unfortunately, that was also easy to game. The last 20+ years has been a struggle between Google trying to rank results correctly and SEO experts trying to rank higher than they deserve. There is an enormous amount of money fueling both sides.

We are just starting with chatbots. I don’t know why it specifically chose Confluence and Notion, but since it’s a probability engine, more text in the training set would be a factor. It understands sentiment, so it can discern good reviews from bad reviews and knows that recommendations should offer good options.

But does it understand authority, bias, expertise, etc. If I stuff the internet with 1000’s of good reviews of my product (even if they would never be read by a human), would ChatGPT then start recommending me? The ChatBot equivalents of SEO experts will be trying to figure this out.

Since ChatGPT is currently stuck in January 2022 (about a year ago), you might not be able influence its suggestions for a year or so. The learning cycle is currently very long. But, that also means that it probably can’t recover from being gamed quickly either. Hopefully, models can be developed that resist gaming.

Don’t Make a Phone-a-Taxi App

In 2007, seeing the iPhone and owning a taxi company, you might think it was a good idea to make an app. After all, people need taxis when they are out and about. Maybe, you think, that people will stop using pay phones, so they won’t see the stickers you plastered there with your phone number.

So, you get Phone-a-Taxi made so that people can use an app to call you up. They tell you where they are and you send a cab. There are hardly any apps in the store, so you get a bunch of downloads and calls, and things look great. You are riding the smart phone hype wave.

And then Uber comes along.

That’s what a lot of LLM features feel like to me. Technically, they are using LLM technology, but mostly just trying to shoehorn chats and autocomplete into apps that were not designed around LLMs. Even GitHub Copilot is guilty of this. I love it, but in the end I still have the same amount of code I had before. I don’t even want code—I just want the program.

Like the Phone-a-Taxi app, they are still doing things the old way.

Hype and Career Bets

My professional career started in 1992. Since then, there have been a lot of hyped technologies, but I only really have acted on very few of them. Most of them, I let go by without taking much action.

I only took big career bets on the web and smart phones. Before I learned how to program for them, I used them every day. Then, I taught myself how to develop for them and changed jobs to do it full-time. In both cases, going from a place where I was very comfortable to a startup.

I passed on nearly everything else: social media, web3, blockchain, big data, ML, functional programming, NoSQL, cloud infrastructure, VR/AR—I know them “enough”, but they were not big factors in my career. Partly it was because I was still benefitting from being an expert, but mostly because I wasn’t personally excited by them as a user or as a software product maker.

I’m thinking about this, because we’re in an LLM hype cycle now. I use products based on it nearly every day. I feel like I did when I used the web and smart phones. I’m getting a lot of value and I want to know how to make things for it.

Prompt Engineering is a Dead End

LLM chatbots are bad at some things. Some of this is intentional. For example, we don’t want chatbots to generate hate speech. But, some things are definitely not intentional, like when they make stuff up. Chatbots also fail at writing non-generic text. It’s amazing that they can write coherent text at all, but they can’t compete with good writers.

To get around some of these limitations, we have invented a field called “prompt engineering”, which use convoluted requests to get the chatbot to do something it doesn’t do well (by design or not). For example, LLM hackers have created DAN prompts that jailbreak the AI out of its own safety net. We have also seen the leaked prompts that the AI companies use to set up the safety net in the first place. Outside of safety features, prompt engineers have also found clever ways of trying to get the LLM to question its own fact assertions to make it less likely that it will hallucinate.

Based on the success of these prompts, it looks like a new field is emerging. We’re starting to see job openings for prompt engineers. YouTube keeps recommending that I watch prompt hacking videos. Despite that, I don’t think that this will actually be a thing.

All of the incentives are there for chatbot makers to just make chatbots better with simple prompts. If we think chatbots are going to approach human-level intelligence, then we’ll need prompt engineers as much as we need them now for humans, which is “not at all.”

Prompt engineering is not only a dead end, it’s a security hole.

OWASP Should Include LLM Prompt Hacks in Injection

Yesterday, I wrote that LLM prompt hacking was like an injection attack. I looked up injection in OWASP’s 2021 10 top of security vulnerabilties and see that it’s number three. Since LLM prominence started this year, they haven’t listed prompt hacking yet, but you can see from their description and list of remedies how similar it is to injection. And since we’re busily attaching LLMs to web applications via their APIs, prompt hacking should be considered a web application security vulnerability in the next survey.

Here’s the top prevention technique:

Preventing injection requires keeping data separate from commands and queries:

The preferred option is to use a safe API, which avoids using the interpreter entirely, provides a parameterized interface, or migrates to Object Relational Mapping Tools (ORMs).
Note: Even when parameterized, stored procedures can still introduce SQL injection if PL/SQL or T-SQL concatenates queries and data or executes hostile data with EXECUTE IMMEDIATE or exec().

For an LLM this means that the LLM itself isn’t affected by the user query. I realize that that may be impossible with current implementations. My suggestion is to somehow create two channels (one for “code” and one for “data”) in the training process so that the resulting model isn’t exploitable this way.

No, I have no idea how to do that, but it’s not with a more convoluted prompt.