Useful General Intelligence — Danielle Perszyk, Amazon AGI

Channel: aiDotEngineer

Published at: 2025-08-02

YouTube video id: Dj0b_cEBHBI

Source: https://www.youtube.com/watch?v=Dj0b_cEBHBI

[Music]
Hi everyone.
That's a little loud. I'm Danielle and
I'm a cognitive scientist working at the
experimental new Amazon AGI SF lab. And
throughout this conference, you're going
to hear a lot of talks about building
and scaling agents, including some from
my colleagues at AWS. But this talk is
going to be a little different. I want
to think about how we can co-evolve with
general purpose agents and what it will
take to make them reliable and aligned
with our own intelligence. So, I'd like
to set the set the stage by reminding us
of a fact about the reliability of our
own minds. We're all hallucinating right
now. Our brains don't have direct access
to reality. So they're they're stuck
inside our heads. So they can only
really do a few things. They can make
predictions with their world models.
They can take in sensory information.
And they can reconcile errors between
the two. That's about it. And that's why
neuroscientists call our brains
prediction machines and say that
perception is controlled hallucination.
Uh, but there's no way, of course, that
I could be standing up here in front of
you if I didn't have my hallucinations
under control. The controlled part is
the critical bit. But that's not all
that's happening right now. If you're
understanding my words, then I'm also
influencing your hallucinations. And
assuming you do understand my words,
then your brain just did something else.
It activated all meanings of the word
hallucination, including this one. So
today we rely upon hallucinating chat
bots for brainstorming, generating
content and code and images of
themselves like this. But what they
can't yet do is think, learn, or act in
a reliable general purpose way. And
we're not satisfied with that because
we've set our sights on building AI that
more closely resembles our own
intelligence.
But what makes our intelligence general?
Well, one thing we know is that
hallucinations are necessary because
they allow us to go beyond the data.
They're features rather than bugs of AI
that's flexible like ours. So, we just
need to figure out how to control them.
I'm going to be drawing a lot of
parallels to our intelligence, but I'm
not saying that we are or should be
building something like a human brain.
We don't want AI to replace us or
replicate us. We want it to complement
us. We want AI plus humans to be greater
than the sum of our parts. Now, this
isn't typically what we think about when
we hear AGI. We think about the AI
becoming more advanced. But this
reflects a category error about how our
intelligence actually works. And that
error is that general intelligence can
exist within a thinking machine. So when
you think about AGI, you probably think
about something like this. And you might
think that it's right around the corner,
but why does it then feel like agents
are closer to something like this?
The reality is that models can't yet
reliably click, type, or scroll. And so
everyone wants to know how do we make
agents reliable? That's the question I'm
going to focus on today. So first I'll
share our lab's vision for agents.
Then I will show you how Novaact which
is a research preview of our agent works
today. And then finally I'll show you
how Novaact will evolve and how you are
all central to that evolution. So let's
start with the big picture. Our vision
for agents is different than the
standard vision which reflects this long
lineage of thought that has become
folklore. So you all know the story by
now which is why you probably spotted
the hallucination here. The concept of
machines that can think like humans
didn't originate in the 2010s, but in
1956 when a group of engineers and
mathematicians set out to build thinking
machines so they could solve
intelligence. Of course, you also all
know that these guys didn't solve
intelligence, but they did succeed in
founding the field of AI and sparking a
feedback loop that changed how we live
and work. So first we built more
powerful computers. Then we connected
them together to build the internet
which enabled more sophisticated
learning algorithms. And this made our
computers even more powerful. And now
we're back to aiming for thinking
machines by another name, artificial
general intelligence or AGI. So the
standard vision is to make AI smarter
and give it more agency. And notice that
this is about the technology, not us.
Well, luckily this wasn't the only
historical perspective. Does anybody
know who this is?
This is Douglas Angelbart and he
invented the computer mouse and the
guey. He didn't care so much about
thinking machines and solving
intelligence. What he cared about was
thinking humans and augmenting our
intelligence. And he proposed that
computers could make us smarter. Of
course, he was absolutely right. So, as
computers became pervasive, they also
started changing our brains. We began
offloading our computation to devices,
distributing our cognition across the
digital environment. And this had the
effect of augmenting our intelligence.
Scientists call this technosocial
co-evolution. It just means that we
invent new technologies that then shape
us. So here we have two historical
perspectives for the goal of building
more advanced intelligence that
resembles our own. We can build AI that
is as smart as or even smarter than us.
Or we can build AI that makes us
smarter. We all believe that more
general purpose agents are going to be
more useful. But how? Well, things are
useful when they have one of two
effects. They can simplify our lives by
allowing us to offload things or they
can give us more leverage. And yes,
automation is an engine for
augmentation. This is how we become
expert at things. We start by paying
conscious attention to the details. We
practice and then our brain moves things
over to our subconscious. Automation
frees up our attention to focus on other
things.
The problem is that automation doesn't
always lead to augmentation. Sometimes
it even comes at a cost. How many hours
have we lost to scrolling? Or how many
echo chambers have we been trapped
within? How many times has autocomplete
just shut down our thinking? So this is
how algorithms can reduce our agency.
And it's how increasingly intelligent
agents might cause more problems than
they solve. But if we have precise
control and we actively tailor these
systems the way that we want, then we
can actually increase our agency. And
this is the crossroads in front of us.
We can continue to make AI smarter and
give it more agency. uh we can focus on
unhobling the AI as it's fashionable to
say, but this doesn't guarantee that it
will be useful to us. It just guarantees
that we'll continue to see a lot of the
same patterns that we've seen in tech
recently. And that's why that our vision
is to build AI that makes us smarter and
gives us more agency to build AI that
unhobles humans. So, how do we do that?
Well, in these early stages, we need to
do two things. We need to meet the
models where they are and meet the
builders where they are. So all of you
have a million ideas about what you want
to do with agents. We have to make it
frictionless for you to get started.
And Nova Act does these two things.
We're building a future where the atomic
unit of all digital interactions will be
an agent call. The big obstacle is that
we still only have some infrastructure
for APIs.
Most websites are built for visual UIs
and so since most websites lack APIs, we
need to use the browser itself as a tool
and that's why we've trained a model of
uh Amazon's foundation model Nova to be
really good at UIs to interact with UIs
like we do. Nova Act combines this model
with an SDK to allow developers to build
and deploy agents. All you have to do is
make an act call, which translates
action uh natural language into actions
on the screen. And I'm going to show you
a demo here where my teammate Carolyn uh
will show you how you can use Nova Act.
Nova Act to find our dream apartment.
We're searching for a two-bedroom, one
bath in Redwood City. Here we've given
our first act call to the agent. It's
going to break down how to complete this
task, considering the outcome of each
step as it plans the next one. Behind
the scenes, this is all powered by a
specialized version of Amazon Nova
trained for high reliability on UI
tasks.
And next, I'm going to show you my
teammate Fjord, who will describe how
you can uh do even more things with
Python integrations. All right, we see a
bunch of rentals on the screen. So,
let's grab them using a structured
extract. We'll define a pyantic class
and ask the agent to return JSON
matching that schema.
For my commute, I want to know the
biking distance to the nearest Cal Train
station for each of these results. Let's
define a helper function. Add biking
distance will take in an apartment and
then use Google Maps to calculate the
distance.
Now, I don't want to wait for each of
these searches to complete one by one.
So, let's do this in parallel. Since
this is Python, we can just use a thread
pool to spin up multiple browsers, one
for each address.
Finally, I'll use pandas to turn all
these results into a table and sort by
biking time to the cow train station.
We've checked this script into the
samples folder of our GitHub repo. So,
feel free to give it a try.
So, we've made it really easy to get
started. It's just three lines of code.
And under the hood, we're constantly
making improvements to our model and
shipping those every few weeks. And this
is important. Because even the building
blocks of computer use are deceptively
challenging. Here's why. This is the
Amazon website. And let me ask you, what
do these icons mean? We typically take
for granted that even if we've never
seen them before, we can easily
interpret them. Uh and and when we
can't, there are usually plenty of cues
for us to know what they mean. Now,
Amazon actually labels these, but in
many contexts, the icons are not
labeled, and we couldn't possibly teach
our agent all of the different icons,
let alone all of the different useful
ways that it could use a computer. So,
we have to let our agent explore and
learn with RL. And it's really
fascinating to think about how RL will
enable these agents to discover how to
use computers in entirely new ways. And
that's okay because we want them to be
complimentary to us. But if we're going
to diverge in our computer use methods,
then it's really critical that our
agents perception of the digital world
is aligned with our own. And that's not
what most agents can can do right now.
So current agents are LLM rappers that
function as readonly assistants. They
can use tools and some of them are
getting really good at code, but they
don't have an environment to ground
their interactions. They lack a world
model. Computer use agents are
different. They can see pixels and
interact with UIs just like us. So you
can think of them as kind of having this
early form of embodiment. Now we're not
the only ones working on computer use
agents, but our approach is different.
We are focusing on making the smallest
units of interaction reliable and giving
you granular control over them. Just
like you can string together words to
generate infinite combinations of
meaning, you can string together atomic
actions to generate increasingly complex
workflows.
Now, grounding our interactions in a
shared environment uh is necessary for
building aligned generalpurpose agents,
but it's not sufficient. Computer use
agents will need something else to be
able to really reliably understand our
higher level goals. So how will NOVA act
need to evolve to make us smarter and
give us more agency? In other words,
what is it that makes our intelligence
reliable and uh flexible and general
purpose? Well, it turns out that over
the past decades, as engineers were
building more advanced intelligence,
scientists were learning about how it
works. And what they learned was that
this isn't the whole story. It's just
the most recent uh story of our
co-evolution with technology. So
computers co-evolving with computers is
is this thing that we're fixated on. But
the story goes back a lot longer. And
Engelbart actually hinted at this. He
said in a very real sense as represented
by the steady evolution of our
augmentation means the development of
artificial intelligence has been going
on for centuries. Now he was correct but
it was actually going on for a lot
longer than that. So let me take you
back to the beginning. Around six
million years ago, the environment
changed for our ancestors and they had
exactly two options. They could solve
intelligence or go extinct. And the ones
that solved intelligence did so through
a feedback loop that changed our social
cognition. This should look familiar.
First, our brains got bigger. Then we
connected them together, which enabled
us to further fine-tune into social
information. And this made our brains
even bigger. But now you know that this
scaling part is only half of the story.
The other half had to do with how we all
got smarter. So we offloaded our
computation to each other's minds and
distributed our cognition across the
social environment. And this had the
effect of augmenting our intelligence.
So scientists call the thing that we got
better at through these flywheels
representational alignment. We figured
out how to reproduce the contents of our
minds to better cooperate. The key
insight here is that the history of
upgrading our intelligence didn't start
with computers. It started with an
evolutionary adaptation that allowed us
to use each other's minds as tools. Let
me say that in another way. The thing
that makes our intelligence general and
flexible is inferring the existence of
other minds. This means that this is
general intelligence. This can be
general intelligence. This could
possibly be general intelligence, but
it's not uh there's no reason to expect
that it will be aligned. And this is not
general intelligence. Intelligence of
the variety that humans have can't exist
in a vacuum. It doesn't exist in
individual humans. It won't exist in
individual models. Instead, general
intelligence emerges through our
interactions. It's social, distributed,
ever evolving. And that means that we
need to measure the interactions and
optimize for the interactions that we
have with agents. We can't just measure
model capabilities or things like time
spent on platform. We have to measure
human things like creativity,
productivity, strategic thinking, even
things like states of flow.
So let's take a closer look at this
evolutionary adaptation. Any ideas as to
what it was?
It was language. So, language co-evolved
with our models of minds in yet another
flywheel that integrated our systems for
communication and representation. And it
did this by being both a cause and an
effect of modeling our minds. Let's
break that down. We've got our models
and our communicative interfaces. And
then here's how they became integrated.
As we fine-tuned into social cues, our
models of mind became more stable. This
advanced our language and our language
made our models of mind even more
stable. And then here's the big bang
moment for our intelligence.
Our models of mind became the original
placeholder concept, the first variable
for being able to represent any concept.
That right there is generalization. So
you might be thinking, but is this
different from other languages? And the
answer is yes. Other communication
systems don't have models of mind.
Programming languages don't negotiate
meaning in real time. This is why code
is so easily verifiable. And LLMs don't
understand language. What do we mean
they don't understand language? They
don't understand that words refer to
things that minds make up. So when we
ask what's in a word, the answer is
quite literally a mind.
So language was so immensely useful that
it triggered a whole new series of
flywheels that scientists call cognitive
technologies. Each one is a foundation
for the next and each one allows us to
have increasingly abstract thoughts.
They become useful by evolving within
communities. So early commu computers
were not very useful to many people.
They didn't have great interfaces. But
Engelbart changed this. Now computers
are getting in our way. We've never had
the world's information so easily
accessible, but also we've never had
more distractions. And agents can help
fix this. They can do the repetitive
stuff for us. They can learn from us and
redistribute our skills across
communities. And they can teach us new
things when they discover new knowledge.
In essence, agents can become our
collective subconscious. But we need to
build them in a way that reflects this
larger pattern. So collectively these
tools for thought stabilize our
thinking,
reorganize our brains and control our
hallucinations. How do they control our
hallucinations? Well, they direct our
attention to the same things in the
environment. They pick out the relevant
signals and the noise and then we
stabilize these signals to co-create
these shared world models. And what does
that sound like? It sounds like what
we're building. So another way of
thinking about Nova Act is as the
primitives for a cognitive technology
that aligns agents and humans
representations. And just like with
other cognitive technologies, early
agents will need to uh evolve in diverse
communities. So that's where all of you
come in. But reliability isn't just
about clicking in the same place every
time. It's about understanding the
larger goal. So to return to our big
question, how do we make agents
reliable? Eventually they're going to
need models of our minds. So the next
thing that we'll need to build is agents
with models of our minds. But we don't
actually build those directly. We need
to set the preconditions for them to
emerge. And this requires a common
language for humans and computers. And
at this point, you know what this
entails? Agents will need a a model of
our shared environment and interfaces
that support intuitive interactions with
us. These will enable humans and agents
to reciprocally level up one another's
intelligence. To advance the models, we
will need human agent interaction data.
And to motivate people to use the agents
in the first place, we'll need useful
products. The more useful the products
become, the smarter we will all become.
So this is how we can collectively build
useful general intelligence. Um if you
want to learn more about Nova Act then
stick around right here for the upcoming
workshop. And thank you for your time.
[Music]