Category Archives: AI

Prompt Engineering is a Dead End

LLM chatbots are bad at some things. Some of this in intentional. For example, we don’t want chatbots to generate hate speech. But, some things are definitely not intentional, like when they make stuff up. Chatbots also fail at writing non-generic text. It’s amazing that they can write coherent text at all, but they can’t compete with good writers.

To get around some of these limitations, we have invented a field called “prompt engineering”, which use convoluted requests to get the chatbot to do something it doesn’t do well (by design or not). For example, LLM hackers have created DAN prompts that jailbreak the AI out of its own safety net. We have also seen the leaked prompts that the AI companies use to set up the safety net in the first place. Outside of safety features, prompt engineers have also found clever ways of trying to get the LLM to question its own fact assertions to make it less likely that it will hallucinate.

Based on the success of these prompts, it looks like a new field is emerging. We’re starting to see job openings for prompt engineers. YouTube keeps recommending that I watch prompt hacking videos. Despite that, I don’t think that this will actually be a thing.

All of the incentives are there for chatbot makers to just make chatbots better with simple prompts. If we think chatbots are going to approach human-level intelligence, then we’ll need prompt engineers as much as we need them now for humans, which is “not at all.”

Prompt engineering is not only a dead end, it’s a security hole.

OWASP Should Include LLM Prompt Hacks in Injection

Yesterday, I wrote that LLM prompt hacking was like an injection attack. I looked up injection in OWASP’s 2021 10 top of security vulnerabilties and see that it’s number three. Since LLM prominence started this year, they haven’t listed prompt hacking yet, but you can see from their description and list of remedies how similar it is to injection. And since we’re busily attaching LLMs to web applications via their APIs, prompt hacking should be considered a web application security vulnerability in the next survey.

Here’s the top prevention technique:

Preventing injection requires keeping data separate from commands and queries:

  • The preferred option is to use a safe API, which avoids using the interpreter entirely, provides a parameterized interface, or migrates to Object Relational Mapping Tools (ORMs).
    Note: Even when parameterized, stored procedures can still introduce SQL injection if PL/SQL or T-SQL concatenates queries and data or executes hostile data with EXECUTE IMMEDIATE or exec().

For an LLM this means that the LLM itself isn’t affected by the user query. I realize that that may be impossible with current implementations. My suggestion is to somehow create two channels (one for “code” and one for “data”) in the training process so that the resulting model isn’t exploitable this way.

No, I have no idea how to do that, but it’s not with a more convoluted prompt.

We Keep Reinventing Injection Attacks

Web programmers can cause security problems if they embed data into HTML and render the result. For example, if I have a simple form that asks for your name and then output a page with that name in it, I’ll open myself up to an “injection” attack if the user types in some Javascript, and I don’t carefully escape it. I’ll end up running that Javascript.

The same is true if we take user data and try to create queries by concatenating it with SQL, as lampooned by XKCD.

We invented encoding and string interpolation techniques to solve this. But nothing forces you to use those features, so we still mess it up, which is why security bounties are frequently paid for injection attacks.

But, those issues are with legacy languages like HTML and SQL where we send strings that mix code and data over the network and run them. We should have designed them in a way that separated the code and the data. Surely, we learned from that for new things that we invented since then.

We did not.

An LLM chatbot is also a service that we send strings over a network to. The prompt you send is “code” in natural language and the LLM “runs it”. The problem is that there is a kind of meta-language that controls the chatbot itself, which can be sent before your normal prompts. Using these “jailbreaking” prompts, you can trick the LLM into dropping its safety net and produce hate speech or help you code malware.

These prompts are essentially the same idea that Bobby’s mom is using in the comic, and the solution is likely going to be a prompt version of what encoding and string interpolation is doing.

It would be better if the system was designed such that user chat requests weren’t treated like a program that could change the chatbot itself.

Be Happy When It’s Hard. Be Worried When It’s Easy.

When I was running product development for Atalasoft, I used to love it when a developer was having a hard time with a project. We sold image codecs, and our main competitor was open-source. Our secret sauce was that we could do things they wouldn’t. It was supposed to be hard.

If it were easy, everyone would do it, and it would be free. We wouldn’t have a business.

I think about this a lot when I see what people are doing with Large Language Models. Making an LLM isn’t easy, but Google thinks there’s no moat here for anyone. Still, it’s hard enough because it costs a lot of money, even if you know exactly how to do it.

The part that’s more concerning is what people who use LLM’s are saying. Everyone is so surprised how well it does with basically no work or skill on the part of the user. That’s fine, but then anyone could do it, and the thing they are doing with LLMs isn’t going to accrue value to them.

I think every knowledge worker should be using LLMs in some way, if only to learn about it. It offers enough benefits right now that it can’t be ignored. But, the easier it seems to be that you are getting good results, the more I would be concerned that you won’t be necessary to get those results in the future.

Observations on the MIT Study on GitHub Copilot

I just saw this study on GitHub Copilot from February. Here is the abstract:

Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to implement an HTTP server in JavaScript as quickly as possible. The treatment group, with access to the AI pair programmer, completed the task 55.8% faster than the control group. Observed heterogenous effects show promise for AI pair programmers to help people transition into software development careers.

The researchers report benefits to less experienced developers, which is at odds with this other study I wrote about and my own intuition. However, all of the developers were experienced Javascript developers, and not literally learning programming, which is where I think the more detrimental effect would be.

Approaching Infinity

Moore’s Law predicts that the number of transistors on a chip doubles every eighteen months. But, it has always been understood to be a statement about system capability as well. Speed, memory—we’re even getting advancements in power consumption now with Apple Silicon.

The doubling results in an exponential curve, but at the start, doubling a tiny number doesn’t get you much. My first computer had 4k of memory, but it was already an old model when I got it. By the next year, I had a Commodore 64 with 64k, then a Commodore 128(k) a few years later. My C64 was 1MHz in 1984. In 1992, my first work computer was a 16MHz 386 with 1MB of memory. Nice growth, but from a very low base, so still very underpowered in absolute numbers.

But, just like in personal finance, compounding eventually has enormous impact. It’s not just speed and power. We’re feeling it across all industries. Ubiquitous Software Copilots, Vision Pro, new vaccines, technology-enabled sports analytics, pervasive remote-work—all enabled by the last few doublings.

A doubling means that you have the equivalent impact of the entire industry back to the UNIVAC compressed into eighteen months. And the next 18 months doubles that.

I know this is nothing new. Ray Kurzweil’s Singularity described this in 2005. I’m more pointing out that here we are, and it seems like an inflection is happening where we’re doubling big numbers.

In my 30+ year career as a developer, I experienced a steady stream of big industry shifts. In the 90’s, it was web, then, in the 2000’s, it was web 2.0 and the advent of smart phones. The 2010’s were driven by XaaS (platform, infrastructure, etc) technologies. I could learn these as they happened. There wasn’t instantaneous adoption—you could keep up.

Now these waves are coming very fast, and I wonder if this is what it feels like when you start to approach infinity.

LUIs Give You User Intent

Language User Interfaces (LUIs) are driven by natural language prompts which an LLM can use to drive your command-based application. Even if the LUI makes mistakes, the prompts are a treasure trove of user intent.

Right now, we broadly have two ways to get user data: Analytics and User Research. Analytics are easy to scale and are useful, but they cannot give you user intent. They can tell you what the user did, but not why. User research is targeted right at uncovering causal and intent data, but it’s hard to scale.

A LUI gives you the best of both worlds because it asks the user to express what they want in their own words and can easily be deployed to all users.

As an example, consider a dashboard configuration GUI for a B2B SaaS app. Almost every enterprise application has something like this—in this case, let’s consider Salesforce.

Using a GUI, a user might tap on “New Dashboard” and then “Add bar chart” and then use some filters to set it up. And then, they “Add pie chart” and set that up. They put in another chart, then quickly delete it. They add, delete, reorder, and configure for an hour until they seem to be satisfied. In an analytics dataset, you’d have rows for all of these actions. You would have no idea what the user was trying to do.

In a LUI, the user might start with “I have a 1:1 with my manager on Thursday. What are some of the things I excel at that would be good to highlight”. “Ok, make a dashboard showing my demo-to-close ratio and my pipeline velocity”. “Add in standard personal sales data that a sales manager would expect”.

This is something you could find out in user research, but it’s quite expensive to get that data. Some kind of LUI, even if it wasn’t great, would start to help you collect that data at scale.

You might found out a new Job to be Done (1:1 meetings with sales managers) that you could directly support.

ChatGPT Can Add a LUI to a Terminal Command based UI

Before GUIs took off, there were more command-driven applications. The program would respond with text answers like a specialized chat. As I said yesterday, this is not a Language User Interface (LUI), which would use natural language, not specialized commands.

One benefit that command driven systems have over modern GUI systems is that ChatGPT could probably drive them. Large language models seem to have no problem learning programming languages, even niche ones. You can even teach them new ones in your prompt.

To take advantage of this, we should be adding very well specified mini-languages to our applications to help our users get help from ChatGPT. Here’s a simple example based on a fictional airline flights query language I made up:

In your application, you would offer a command window that already has primed ChatGPT with the specification of the language and many examples. I barely had to do anything to get it learn this simple command. I went on in the chat to ask for more complicated queries, even a series of queries to find out about connecting flights, and it had no problems.

In your application, you only need to parse the terminal like commands, which is a lot easier than implementing a natural language parser, even for a constrained topic like airline booking.

I’m sure ChatGPT could build the command parser for you too if you wanted.

LUI LUI

I go by Lou, but my entire family calls me Louie, so I smiled when I found out that there is such a thing called a Language User Interface that uses natural language to drive an application and that it was called a LUI.

In a LUI, you use natural language. So this is not the same as a keyword search or a terminal style UI that uses simple commands like the SABRE airline booking system.

In this video, it output responses on a printer. But the display terminal version was not that different. I worked on software that interfaced with it in 1992, and this 1960’s version is very recognizable to me.

But, this is not a LUI. A LUI does not make you remember a list of accepted commands and their parameters. You give it requests in just the way you would a person, with regular language.

In SABRE, a command might look like this:

    113JUNORDLGA5P

But, in a SABRE LUI, you’d say “What flights are leaving Chicago O’Hare for Laguardia at 5pm today?” which may be more learnable, but a trained airline representative would be a lot faster with the arcane commands.

With a more advanced version that understood “Rebook Lou Franco from his flight from here to New Orleans to NYC instead” that uses many underlying queries and commands (and understands context), the LUI would also be a lot faster.

This would have seemed far-fetched, but with ChatGPT and other LLM systems, it feels very much within reach today.

Morning Pages Make Me Feel Like ChatGPT

In the first episode of my podcast I said that I do morning pages to train myself to write on demand, and then I followed that up in Episode 3 where I explained that I use the momentum from morning pages to write a first draft of something.

While doing my morning pages last week I thought about how doing them is kind of like how ChatGPT generates text. It’s just statistically picking the next word based on all the words so far in the prompt and what it has already generated.

I am also doing something like that In my morning pages. I am writing writing writing and I use the words so far to guide my next ones.

My mind strays as I write and a phrase might trigger a new thread, which I follow for a bit and then follow another and another. ChatGPT’s results are a lot more coherent than my morning pages. It has an uncanny ability to stay on topic because it is considering all of the text, and I don’t.

First drafts are different. When I switch to writing a first draft, I do consider the entire text. I’m not as fast, because I am constantly looking at where I am so far. I also start with a a prompt in the form of a simple message that I hope to convey, which I use as the working title.

I know I could get a first draft faster from ChatGPT, but it would not be as good (I think), or at least not specific to me. More importantly, I would not have improved as a writer.

[NOTE: While writing a draft of this post, I thought of a way to make my morning pages more directed and made a podcast about it]