ACT-1: How Adept Is Building the Future of AI with Action Transformers
The future of AI is digital and physical agents that can act in the world guided by human commands.
One of AI’s most ambitious goals is to build systems that can do everything a human can. GPT-3 can write and Stable Diffusion can paint, but neither can interact with the world directly. AI companies have been trying to create intelligent agents this way for 10 years. This seems to be changing now.
One of my latest articles covers Google’s PaLM-SayCan (PSC), a robot powered by PaLM, the best large language model to date. PSC’s language module can interpret human requests expressed in natural language and transform them into high-level tasks that can be further broken down into elemental actions. The robotic module can then perform those actions in the real world to fulfill the user’s request. Although it has important limitations (that I reviewed), Google’s PSC is one of the first examples of the integration of last-generation AI with robotics — the digital connected to the physical.
But physical robots aren’t the only AI agents that can affect the world directly. Another promising line of research is that of digital agents that can interact with software through open-ended action. I’m not talking here about GPT-3 or DALL·E which only have an extremely constrained action space (they lack any type of motor system, physical or digital) and thus can only affect the world indirectly (after a human reads or sees their generations). I’m referring, for instance, to OpenAI’s Video PreTraining model (VPT. I covered it here), which learned to play Minecraft by watching humans play — mimicking, to some extent, their behavior.
So far, there aren’t many systems like VPT because it’s a very new technology. It’s hard to train and build these systems: Whereas GPT-3 can only modify the world’s information, VPT can also modify the world’s state—even if at the digital level. VPT enjoys a higher degree of agency and represents a step closer to general intelligence.
But today I won’t be talking about Google’s PSC or OpenAI’s VPT.
Adept builds the first Action Transformer (ACT-1)
Today’s article features a new AI startup, Adept, which was launched earlier this year to build “useful general intelligence.” Adept has brought together great talent from Google, DeepMind, and OpenAI — including the guys who invented the transformer in 2017. Their main goal is, like most other AI companies, to build intelligent agents, but their vision is rather unique.
Instead of building an AGI that could do everything we can — which would lead to eventual mass replacements of the human workforce — they want to build an intelligent interface that can act as a natural translator between humans and the digital world — AI and humans working in collaboration instead of competition. As researcher David Luan says (now working at Adept, previously at Google and OpenAI), “we want to build a natural language interface… — an NL frontend to your computer.”
Adept has now announced its first AI model: The Action Transformer (ACT-1). They haven’t disclosed any technical details except that it’s “a large-scale transformer trained to use digital tools.” (They will publish a paper shortly. I’ll update this article to include new info.) Adept’s ACT-1 is a digital agent intended to communicate with other programs and apps and serve as an interface between us and the digital world, a natural human-computer interface (HCI).
Adept has released a few short demo videos in which you can see ACT-1 in action. It can take high-level requests expressed in natural language and perform them— pretty much like Google’s PSC. The tasks can take up several steps across software tools and websites, varying in complexity. ACT-1 can do tasks that involve various tools at different points of the process and can take in user feedback to improve.
Most importantly, ACT-1 can perform actions that we wouldn’t know how to do. This is where ACT-1’s usefulness becomes apparent. ACT-1 can act as a multitasking meta-learner capable of handling all kinds of software apps. To make it work we’d only have to know how to communicate with ACT-1 and the outcome we want. If ACT-1 worked perfectly, we wouldn’t have to learn to use Excel, Photoshop, or Salesforce. We’d simply delegate the work to ACT-1 and focus on more cognitively challenging problems.
If you re-read the last paragraph you can glimpse the two key aspects of ACT-1 (and digital agents in general) to which I’ll devote the rest of the article. First, we have a very big “if” in “if ACT-1 worked perfectly.” If it didn’t, how could we know when to trust it when we’re faced with a task for which we lack the ability or knowledge (e.g., you have to organize some data but don’t know how to use Excel)? Second, using ACT-1 correctly requires that we know how to communicate with it — which reinforces just how important prompting will be in the future (if it wasn’t already clear from GPT-3 or Stable Diffusion).
Let’s go with the first point.
Big promises at the other side of daunting challenges
One of the main limitations of transformer-based models like GPT-3 is that they’re too unreliable to be used in high-stake settings (e.g., mental health therapy). The reason is these models (GPT-3 but also VPT or ACT-1) are trained on internet data and optimized to learn the next token/action given a history of previous ones. This means they lack common sense, the ability to express intent, or a deep understanding of how the world works. These AI systems are inherently limited in what they can do and how well they work. Some limitations could be a matter of scale (more data and more parameters would solve them), but others seem to be intrinsic to how they’re designed and developed.
ACT-1, although intended for a different purpose than GPT-3 — and despite its larger action space—, falls under the same category of models and is therefore limited for the same reasons. Its ability to interpret human requests doesn’t include the ability to understand intention. In one of the demo examples, the user asks ACT-1 to find a house in Houston for less than 600K. ACT-1 goes and finds a house that meets the criteria, but it doesn’t know — unlike any human who would instantly infer it — that the user wants the house for something else. There’s some intention behind the request and some real-world context needed to make the right decisions. ACT-1 can’t access that information.
Now, let’s go for that big “if” I mentioned above. In the house-search case, it’s expected that the user knows how to do it without ACT-1. But what if the user makes a request they wouldn’t know how to complete otherwise? One reasonable possibility is that the user would blindly trust the AI system to be doing the right action (we, foolishly, do this all the time). That’s the perfect recipe to develop an unhealthy dependence on AI we can’t trust.
You could argue that we’re already dependent on many, many layers of abstraction that, if broken apart, would leave us — and everyone else—completely defenseless. That’s true, but we tend to build those layers with sufficient trustworthiness (e.g., planes seem to work just right although you most likely don’t have a clue how they work or whether they’re reliable. You trust that society has an incentive to build them so they don’t crash). In contrast, deep learning-based systems — even state-of-the-art ones — aren’t built with this kind of reliability.
The other possibility is if the user, aware of how AI works and knowledgeable of its risks, decided to not trust the system blindly. But, even in this case, they wouldn’t have the means to assess whether ACT-1 did the right thing or not. The same happens when we ask GPT-3 something we don’t know the answer to. In some cases, we could simply check it by ourselves (which ruins the very utility of the system) but in other cases, we couldn’t.
If the user’s distrust is strong enough, it could lead them to not use the system. But then another question arises: What if society starts to heavily rely on these types of natural language digital interfaces (like it did in social media or smartphones)? Big trouble.
Until we can build AI we can trust (as professor Gary Marcus would say), the promises of systems like ACT-1 are just that, promises. If it can’t work reliably, ACT-1 is just a very expensive tool that can only do tasks we can also do — and oftentimes we’d have to go and redo what it did wrong.
ACT-1’s ultimate purpose is ambitious but there are important challenges ahead before companies like Adept (or OpenAI or DeepMind) can get there.
Prompting is the future of human-computer interaction
Now, to the second point: the importance of prompting (I’ve previously written about this in “Software 3.0 — How Prompting Will Change the Rules of the Game”).
You’re likely quite familiar already with the concept of prompting (state-of-the-art generative models like GPT-3, LaMDA, DALL·E, Stable Diffusion, etc. all work with prompts). In case you aren’t, prompting is simply a way to communicate with AI systems (and, more generally, with computers) using human natural language (English, for instance), to make them do some specific action or task.
Prompting is how we make generative AI models do what we want. If you want GPT-3 to write an essay you can say: “Write a 5-paragraph essay on the risks of AI.” That’s a prompt. If you want DALL·E to create a beautiful image you can say: “A cat and a dog playing with a ball on a summer day, in a vibrant and colorful style, HD.” That’s another prompt. Google’s PSC and Adept’s ACT-1 work the same way.
Prompting contrasts with programming languages in that it’s highly intuitive for us. Programming languages like Python or C are the most common HCIs today. Computers are built to understand these languages natively but we have a harder time learning them (they can require years of practice to master). Because prompting is nothing else than natural language, we can learn it right away.
Some have drawn analogies between prompting and no-code tools, but there’s an important difference. Although no-code software removes the need to learn to code, it still requires users to learn each specific tool separately — no-code tools aren’t meta-tools. To make ACT-1—a meta-tool—do something you only need one skill; prompting. It’s a form of no-code, true, but it’s also a transversal, natural skill — the ultimate dream of non-tech-savvy people.
To put prompting in historical context, we can see it as the last step of a long history of HCIs (punch cards, machine code, assembly, low-level programming languages, and high-level programming languages). We’ve been climbing up the stairs from talking in machine language to talking in human language. We’ve built increasingly abstract layers on top of the previous ones with the goal to hide the complexity behind human-computer communication to make it easier for us.
Prompting is the latest, simplest, and most intuitive way to make a computer do something. It’s the most powerful communicative HCI because it allows us to feel comfortable in our territory. It reduces the barriers to digital users to the very minimum. For these reasons, I think prompting HCIs will be as ubiquitous in a few years as smartphones are today. A tool we’ll use daily, for anything that has to do with the digital world.
Prompting is intuitive, but a skill nevertheless
But even if prompting is the most natural way to communicate with computers yet, it’s not an innate ability. It’s a skill that requires practice to master (even if the amount it requires isn’t comparable to learning to program). You can think of it as a new mode of discourse — equivalent to how we modify our tone, style, and vocabulary when talking to a kid, or how politicians use rhetoric when talking to us. Prompting is natural language communication directed toward a particular target, in a particular form. Is in that sense that it takes time to master.
Tech blogger Gwern suggests framing prompting as a new programming paradigm. This definition can alienate non-coding people but it helps them understand that it’s not innate. It’s a skill that, although highly intuitive, requires practice, too (e.g. Making GPT-3 output what you want may take several attempts before you get something decent).
As Gwern explains, if we imagine prompting as a programming language, each prompt can be understood as a program. When you input a request in English to GPT-3, you are “programming” it to do a task that other versions of it don’t know how to do. You’re creating a slightly different version of it. Thus, prompting isn’t just a way to communicate our wants and needs to computers. It’s also a way to teach them to do new tasks with natural language.
Gwern emphasizes that prompting is a skill using GPT-3 as an example. He says one of the main criticisms GPT-3 received early on was its inability to correctly perform some basic language tasks. He managed to prove the critics wrong by finding better prompts. He proved not all prompts are equally good or equally valid to achieve some result—in the same way that talking to a fellow human can be seen as a skill and can be performed better or worse (if we extend this argument to the infinite, we find that everything is a skill).
Even if AI systems like GPT-3 or ACT-1 prove to be very useful, people will still need to learn to create good prompts (similarly to what we now do with GPT-3 or Stable Diffusion, which are tools not everyone has mastered to the same degree).
Anyway, although prompting isn’t the panacea, it’s definitely a great leap forward in human-computer interaction — and will democratize the ability to leverage computers, programs, apps, and other tools for people that otherwise wouldn’t be able to.
Ambiguity and context: Prompting’s Achilles’ Heel
However, despite the great advantages and upsides that prompting entails compared to previous HCIs, it’s not perfect. It has one important shortcoming: The inherent ambiguity of human language combined with the lack of context.
If you think about programming languages (and even no-code tools) there’s no room for interpretation. The syntax is rigid and clear. If you type a sentence in Python, it can only mean one thing and the computer doesn’t need to “reason about” or “understand” its meaning. It can immediately act according to the request. Because prompts live in the domain of natural languages, they lose the rigidity and non-ambiguity of traditional code. This is a critical problem if we want HCIs that work with prompts.
We, humans, understand each other (although not necessarily always) because we have access to both a pool of shared knowledge that we assume is common and to the contextual information that surrounds any given interaction. That’s the pragmatic side of language, which can’t be integrated into GPT-3 or ACT-1’s requests.
On the one hand, these systems lack common sense and access to our shared knowledge about the world. On the other hand, they lack the specific context of any given interaction, because that context is often intransmissible through explicit means (that is, written or spoken language). This implies that when there’s ambiguity, ACT-1 will have to either take a guess or stop right there and not finish the task.
I’ve previously written about this key limitation of large language models (now also present in these digital agents, and even in physical robots like PSC or the upcoming Tesla bot) and I don’t see how we could overcome it.
The only solution I see is to design and develop these AI systems like we teach and educate children. They’d need to grow up in the world and interact with it to interiorize all the knowledge they’re missing. Another option is to limit their scope to any task that wouldn’t require context, but that’s probably too restrictive for them to be useful. We’ll have to wait to see if I’m wrong here and there’s another way.
Conclusions
Adept has begun the ambitious quest to build AI agents that can act in the world. Like OpenAI, DeepMind, or Google, it has daunting challenges ahead to develop AI that’s not only very capable, but reliable.
Adept’s vision and goals reinforce the importance of prompt programming as a new software paradigm to communicate with AI systems and computers in general. They also reveal the advantages over traditional HCIs, like programming languages, and the seemingly insurmountable limitations.
All in all, Adept is definitely a company worth keeping an eye on. The first Action Transformer, ACT-1, opens a promising line of research that will give a lot to talk about in the coming months/years.
Alberto - another great piece that soundly balances hope and hype. As someone in the enterprise tech space, I am intrigued by this potential product. It is a very clever way to layer NLP on top of billions of dollars of existing software investments. It will be interesting for Adept to release more details on the state of ACT-1. The demos are slick but there is no further evidence of capability. I’m also curious how the model was trained. I agree that there are challenges to creating an NLP that can handle every user intent for an Excel spreadsheet. But I could see where it could still be highly effective. I’m guessing that a lot of spreadsheet use follows a Pareto distribution. A small number of capabilities like highlighting, or creating a profit column, likely make up a disproportionate volume of user activities. Just solving for those use cases could provide significant lift.
Great article, Alberto, even though I read it only now, it's still relevant. Are there any updates? Have you come up with more possible solutions to build such AI agents?