Category Archives: AI

Gaming ChatBot Recommendations

I just had conversation with ChatGPT that ended with it making product recommendations. I was asking about documentation, so it suggested Confluence and Notion.

Those are good options and do match what I was asking about. But, if ChatGPT is going to make product recommendations, then companies will try to game this like they do for search engines.

Before Google, search engines ranked pages on the count of keywords in the document. That was easy to game by just stuffing the keywords. Google solved this by ranking the search results by backlinks. They reasoned that pages with the most incoming links must be better because a backlink was like a vote of confidence. Unfortunately, that was also easy to game. The last 20+ years has been a struggle between Google trying to rank results correctly and SEO experts trying to rank higher than they deserve. There is an enormous amount of money fueling both sides.

We are just starting with chatbots. I don’t know why it specifically chose Confluence and Notion, but since it’s a probability engine, more text in the training set would be a factor. It understands sentiment, so it can discern good reviews from bad reviews and knows that recommendations should offer good options.

But does it understand authority, bias, expertise, etc. If I stuff the internet with 1000’s of good reviews of my product (even if they would never be read by a human), would ChatGPT then start recommending me? The ChatBot equivalents of SEO experts will be trying to figure this out.

Since ChatGPT is currently stuck in January 2022 (about a year ago), you might not be able influence its suggestions for a year or so. The learning cycle is currently very long. But, that also means that it probably can’t recover from being gamed quickly either. Hopefully, models can be developed that resist gaming.

Don’t Make a Phone-a-Taxi App

In 2007, seeing the iPhone and owning a taxi company, you might think it was a good idea to make an app. After all, people need taxis when they are out and about. Maybe, you think, that people will stop using pay phones, so they won’t see the stickers you plastered there with your phone number.

So, you get Phone-a-Taxi made so that people can use an app to call you up. They tell you where they are and you send a cab. There are hardly any apps in the store, so you get a bunch of downloads and calls, and things look great. You are riding the smart phone hype wave.

And then Uber comes along.

That’s what a lot of LLM features feel like to me. Technically, they are using LLM technology, but mostly just trying to shoehorn chats and autocomplete into apps that were not designed around LLMs. Even GitHub Copilot is guilty of this. I love it, but in the end I still have the same amount of code I had before. I don’t even want code—I just want the program.

Like the Phone-a-Taxi app, they are still doing things the old way.

Hype and Career Bets

My professional career started in 1992. Since then, there have been a lot of hyped technologies, but I only really have acted on very few of them. Most of them, I let go by without taking much action.

I only took big career bets on the web and smart phones. Before I learned how to program for them, I used them every day. Then, I taught myself how to develop for them and changed jobs to do it full-time. In both cases, going from a place where I was very comfortable to a startup.

I passed on nearly everything else: social media, web3, blockchain, big data, ML, functional programming, NoSQL, cloud infrastructure, VR/AR—I know them “enough”, but they were not big factors in my career. Partly it was because I was still benefitting from being an expert, but mostly because I wasn’t personally excited by them as a user or as a software product maker.

I’m thinking about this, because we’re in an LLM hype cycle now. I use products based on it nearly every day. I feel like I did when I used the web and smart phones. I’m getting a lot of value and I want to know how to make things for it.

Prompt Engineering is a Dead End

LLM chatbots are bad at some things. Some of this is intentional. For example, we don’t want chatbots to generate hate speech. But, some things are definitely not intentional, like when they make stuff up. Chatbots also fail at writing non-generic text. It’s amazing that they can write coherent text at all, but they can’t compete with good writers.

To get around some of these limitations, we have invented a field called “prompt engineering”, which use convoluted requests to get the chatbot to do something it doesn’t do well (by design or not). For example, LLM hackers have created DAN prompts that jailbreak the AI out of its own safety net. We have also seen the leaked prompts that the AI companies use to set up the safety net in the first place. Outside of safety features, prompt engineers have also found clever ways of trying to get the LLM to question its own fact assertions to make it less likely that it will hallucinate.

Based on the success of these prompts, it looks like a new field is emerging. We’re starting to see job openings for prompt engineers. YouTube keeps recommending that I watch prompt hacking videos. Despite that, I don’t think that this will actually be a thing.

All of the incentives are there for chatbot makers to just make chatbots better with simple prompts. If we think chatbots are going to approach human-level intelligence, then we’ll need prompt engineers as much as we need them now for humans, which is “not at all.”

Prompt engineering is not only a dead end, it’s a security hole.

OWASP Should Include LLM Prompt Hacks in Injection

Yesterday, I wrote that LLM prompt hacking was like an injection attack. I looked up injection in OWASP’s 2021 10 top of security vulnerabilties and see that it’s number three. Since LLM prominence started this year, they haven’t listed prompt hacking yet, but you can see from their description and list of remedies how similar it is to injection. And since we’re busily attaching LLMs to web applications via their APIs, prompt hacking should be considered a web application security vulnerability in the next survey.

Here’s the top prevention technique:

Preventing injection requires keeping data separate from commands and queries:

  • The preferred option is to use a safe API, which avoids using the interpreter entirely, provides a parameterized interface, or migrates to Object Relational Mapping Tools (ORMs).
    Note: Even when parameterized, stored procedures can still introduce SQL injection if PL/SQL or T-SQL concatenates queries and data or executes hostile data with EXECUTE IMMEDIATE or exec().

For an LLM this means that the LLM itself isn’t affected by the user query. I realize that that may be impossible with current implementations. My suggestion is to somehow create two channels (one for “code” and one for “data”) in the training process so that the resulting model isn’t exploitable this way.

No, I have no idea how to do that, but it’s not with a more convoluted prompt.

We Keep Reinventing Injection Attacks

Web programmers can cause security problems if they embed data into HTML and render the result. For example, if I have a simple form that asks for your name and then output a page with that name in it, I’ll open myself up to an “injection” attack if the user types in some Javascript, and I don’t carefully escape it. I’ll end up running that Javascript.

The same is true if we take user data and try to create queries by concatenating it with SQL, as lampooned by XKCD.

We invented encoding and string interpolation techniques to solve this. But nothing forces you to use those features, so we still mess it up, which is why security bounties are frequently paid for injection attacks.

But, those issues are with legacy languages like HTML and SQL where we send strings that mix code and data over the network and run them. We should have designed them in a way that separated the code and the data. Surely, we learned from that for new things that we invented since then.

We did not.

An LLM chatbot is also a service that we send strings over a network to. The prompt you send is “code” in natural language and the LLM “runs it”. The problem is that there is a kind of meta-language that controls the chatbot itself, which can be sent before your normal prompts. Using these “jailbreaking” prompts, you can trick the LLM into dropping its safety net and produce hate speech or help you code malware.

These prompts are essentially the same idea that Bobby’s mom is using in the comic, and the solution is likely going to be a prompt version of what encoding and string interpolation is doing.

It would be better if the system was designed such that user chat requests weren’t treated like a program that could change the chatbot itself.

Be Happy When It’s Hard. Be Worried When It’s Easy.

When I was running product development for Atalasoft, I used to love it when a developer was having a hard time with a project. We sold image codecs, and our main competitor was open-source. Our secret sauce was that we could do things they wouldn’t. It was supposed to be hard.

If it were easy, everyone would do it, and it would be free. We wouldn’t have a business.

I think about this a lot when I see what people are doing with Large Language Models. Making an LLM isn’t easy, but Google thinks there’s no moat here for anyone. Still, it’s hard enough because it costs a lot of money, even if you know exactly how to do it.

The part that’s more concerning is what people who use LLM’s are saying. Everyone is so surprised how well it does with basically no work or skill on the part of the user. That’s fine, but then anyone could do it, and the thing they are doing with LLMs isn’t going to accrue value to them.

I think every knowledge worker should be using LLMs in some way, if only to learn about it. It offers enough benefits right now that it can’t be ignored. But, the easier it seems to be that you are getting good results, the more I would be concerned that you won’t be necessary to get those results in the future.

Observations on the MIT Study on GitHub Copilot

I just saw this study on GitHub Copilot from February. Here is the abstract:

Generative AI tools hold promise to increase human productivity. This paper presents results from a controlled experiment with GitHub Copilot, an AI pair programmer. Recruited software developers were asked to implement an HTTP server in JavaScript as quickly as possible. The treatment group, with access to the AI pair programmer, completed the task 55.8% faster than the control group. Observed heterogenous effects show promise for AI pair programmers to help people transition into software development careers.

The researchers report benefits to less experienced developers, which is at odds with this other study I wrote about and my own intuition. However, all of the developers were experienced Javascript developers, and not literally learning programming, which is where I think the more detrimental effect would be.

Approaching Infinity

Moore’s Law predicts that the number of transistors on a chip doubles every eighteen months. But, it has always been understood to be a statement about system capability as well. Speed, memory—we’re even getting advancements in power consumption now with Apple Silicon.

The doubling results in an exponential curve, but at the start, doubling a tiny number doesn’t get you much. My first computer had 4k of memory, but it was already an old model when I got it. By the next year, I had a Commodore 64 with 64k, then a Commodore 128(k) a few years later. My C64 was 1MHz in 1984. In 1992, my first work computer was a 16MHz 386 with 1MB of memory. Nice growth, but from a very low base, so still very underpowered in absolute numbers.

But, just like in personal finance, compounding eventually has enormous impact. It’s not just speed and power. We’re feeling it across all industries. Ubiquitous Software Copilots, Vision Pro, new vaccines, technology-enabled sports analytics, pervasive remote-work—all enabled by the last few doublings.

A doubling means that you have the equivalent impact of the entire industry back to the UNIVAC compressed into eighteen months. And the next 18 months doubles that.

I know this is nothing new. Ray Kurzweil’s Singularity described this in 2005. I’m more pointing out that here we are, and it seems like an inflection is happening where we’re doubling big numbers.

In my 30+ year career as a developer, I experienced a steady stream of big industry shifts. In the 90’s, it was web, then, in the 2000’s, it was web 2.0 and the advent of smart phones. The 2010’s were driven by XaaS (platform, infrastructure, etc) technologies. I could learn these as they happened. There wasn’t instantaneous adoption—you could keep up.

Now these waves are coming very fast, and I wonder if this is what it feels like when you start to approach infinity.

LUIs Give You User Intent

Language User Interfaces (LUIs) are driven by natural language prompts which an LLM can use to drive your command-based application. Even if the LUI makes mistakes, the prompts are a treasure trove of user intent.

Right now, we broadly have two ways to get user data: Analytics and User Research. Analytics are easy to scale and are useful, but they cannot give you user intent. They can tell you what the user did, but not why. User research is targeted right at uncovering causal and intent data, but it’s hard to scale.

A LUI gives you the best of both worlds because it asks the user to express what they want in their own words and can easily be deployed to all users.

As an example, consider a dashboard configuration GUI for a B2B SaaS app. Almost every enterprise application has something like this—in this case, let’s consider Salesforce.

Using a GUI, a user might tap on “New Dashboard” and then “Add bar chart” and then use some filters to set it up. And then, they “Add pie chart” and set that up. They put in another chart, then quickly delete it. They add, delete, reorder, and configure for an hour until they seem to be satisfied. In an analytics dataset, you’d have rows for all of these actions. You would have no idea what the user was trying to do.

In a LUI, the user might start with “I have a 1:1 with my manager on Thursday. What are some of the things I excel at that would be good to highlight”. “Ok, make a dashboard showing my demo-to-close ratio and my pipeline velocity”. “Add in standard personal sales data that a sales manager would expect”.

This is something you could find out in user research, but it’s quite expensive to get that data. Some kind of LUI, even if it wasn’t great, would start to help you collect that data at scale.

You might found out a new Job to be Done (1:1 meetings with sales managers) that you could directly support.