Jack Morris: Stuffing Context is not Memory, Updating Weights is

Channel: aiDotEngineer

Published at: 2025-12-29

YouTube video id: Jty4s9-Jb78

Source: https://www.youtube.com/watch?v=Jty4s9-Jb78

[music]
>> Let's talk about ChatGPT. I think like
ChatGPT knows a lot of things. It's
actually extremely impressive. I use it
all the time. I use it to help prepare
for the presentation. You know, I use it
to cook last night. Um,
you know, like growing increasingly
dependent. And yet there's a lot that
ChatGPT doesn't know. Like um
it didn't know why my speaker pass
wasn't working when I was trying to get
into the building and it uh
if you ask it did the Blue Jays win the
World Series, the answer is no. And I
know that because I watched the World
Series, but ChatGPT doesn't know that if
you don't enable web search because it
has something called a knowledge cut
off. So all the training data is kind of
segmented by date and things after a
certain date are not known by ChatGPT
like unilaterally.
Uh if you ask ChatGPT help me optimize
this kernel I wrote for AMD GPUs,
it's so bad at it. And I think there's a
few reasons for this. One, it's really
hard. Two, uh there's not a lot of data
for it. But three, I think it's more
that the data that does exist is such a
small portion of its training data that
it just like can't do it very well. And
so a lot of tasks like this, which I
I would guess a lot of you face in your
jobs, like the things that are more
niche or here I call long tail, are
really hard for ChatGPT to do. Even if
you say, "Please like
please sir." [laughter]
Like I want you to learn more about this
or practice. Like it can't learn more
about this. It can't practice. It
doesn't know what to do when you ask it
that. And yeah, if you ask what are the
terms of our partnership agreement for
BlackRock, it doesn't know about your
company. Which of these shirts should I
order from Amazon? Implement a new
feature in our
company mono repo. Write an email in my
style. Diagnose this patient given their
history. What arguments did the opposing
counsel use in the Martinez settlement
negotiations?
Is this question already answered on our
company internal wiki? Like none of
these things are
possibly answered by ChatGPT because
they're not in the training data or
they're too niche or they require some
data that's not available to it.
So I think like the question I want to
talk about today is like what's the
right [clears throat] way to solve this
problem? Like if we want to build new
systems that actually know the things we
want them to know, how how should we
build them? And I think like the way I
want to think about it is like how do we
take some knowledge and inject it into
the parameters of the model? Like what's
the right way to do this?
And like the way that I think about it
and I think the way this manifests in my
research and other people's research is
there's three ways. There's
full context. You can take as much stuff
as you can and cram it into the language
model. There's rag or retrieval
augmented generation where you have so
many things that you can't fit them all
in and so you retrieve the most useful
ones and then feed them in.
And then there's this third thing, which
I think is like really new and no one is
doing it yet, which is training things
into weights. And I want what I mostly
want to talk about today is like why I
think we should be training things into
weights. But I'm going to start with the
other two.
And also I guess like along the way I
about 10% of the time I'm going to be
shilling my own research, but I'm going
to like try to be honest about it. And
you can just tune me out if you want.
So I think like the easiest way to solve
these problems is to put everything into
context. Like if you work at a small
company or
all you care about is like maybe the 100
World Series that have occurred, you can
kind of copy all the data and paste it
into ChatGPT or paste it into Grok or
whatever model you use and that's
finite enough that the model can
understand.
And this like works works pretty well. I
think that this is something that got
people really excited for a while a few
years ago. I have this example of like a
doctor answering a question from a
medical record. A medical record is
small enough that it can presumably be
like input into the context of the model
and the model can do pretty well. I
think there's a few problems with this.
Maybe the main one is just that it's so
expensive. Like if you do anything like
this in your day-to-day workflow, you
put like a ton of tokens in the context
to start generating. I mean one, it's
going to cost a lot of money like US
dollars. But two, it's just so slow like
you know, a few months ago I was writing
my thesis and I wrote it myself, but I
did ask for some feedback a few times
from Claude. And like the second you
paste in I I don't know, it's like
maybe 80 pages of text or something. It
like as
documents go, it's medium length. I
paste into Claude, the second you paste
into Claude, everything slows down by
10x or something. I have this stat here
that if you have 1,000 tokens of
context,
we can output 10,000 tokens per second.
If you have 128k per ton
128k tokens of context, we can output
130 tokens per second. So that's like
several orders of magnitude slow down
and I think we've all faced this. So
it's very annoying and it's hard to
imagine how we can get around this.
Um I'll give you like the quick
background from the research world,
which maybe people know, which is this
inherent limitation the models we use.
The models we use are transformers.
Transformers look like this. The real
problem with transformers comes
in this one little
box right here called self-attention.
The problem is that all of the words
that go into the transformer need to
look at each other. And this has a
quadratic tendency. So if there's four
words, four tokens, maybe the matrix has
16 entries. If there are 12 tokens,
there are 144 entries. And we can manage
this for a while, but at some point it
becomes infeasible. Like especially from
a memory perspective, we can't
From a memory perspective, we can't keep
all these things in context.
You might say, "Well Jack, Grok 4 has 2
million token context window."
Yeah, 2 million token context window.
It's It's a very large number. Gemini 3
dropped
during this conference and Gemini 3 has
1 million token context window.
You also might ask why did Gemini 3 not
do a larger context window even though
it came after Grok. And I think the
reason is because there's
[clears throat] a difference between
the model not breaking when you put in
that many tokens and the model actually
like properly reasoning across many
large chunks of tokens.
And I think the second part we're still
figuring out. I think people have
realized how to train models that don't
break with more and more tokens, but we
haven't really gotten to the point where
we can train models that truly work as
well on a million tokens as they do on a
thousand tokens.
And if you're more curious about this,
there's this really good report from
Chroma called context context broad
about how performance degrades when you
add just like other stuff into the
context. So this graph shows like the
larger the context grows, even with the
same finite amount of relevant
information, the LLMs get worse and
worse. And I think like two things to
observe here that I think are
interesting. One, Claude is the best by
far. I like graphs like this because I
feel like when you talk to people, a lot
of people think Claude is the best. But
if you
measure on a lot of standard benchmarks,
it actually is worse. But then you use
it and you're like, "Oh, there's
something's better here." So I like this
because it captures what people actually
say to me. But I also like it because
once you get here, the performance is
horrible. So like if they if they enter
a bunch of relevant stuff that doesn't
actually help you solve the problem,
once you get to 10 to the fourth tokens,
which is 10,000, like the models don't
work at all. And even though they're not
breaking, like they're outputting
things that make sense and are
grammatical,
they're not actually solving the
problem. So context broad is a huge
issue.
Um
maybe like just anecdotally, if you look
up there's a ton of people saying stuff
like this. Like, "Oh, what the context
window so long, why does it not actually
work?" Or people think Claude code when
it fills up the context window sort of
like stops working.
Um there's a ton of people working on
these efficient architectures that you
might hear about like
>> [music]
>> Mamba, state space models, linear
attention,
hybrid attention, sparse attention,
sliding window. They're all more
efficient, but they basically have the
same properties of transformers. Like
even if they can
operate uh in a faster time or with a
lower memory requirement, there's some
trade-off in the terms of performance
they give you. So even if you build a
linear attention model that can fit
infinite context, it's not good. Like
it's not going to be able to solve the
problem you have, which is how do I
actually like reason and get smarter
when I input more tokens into the model?
There's so many examples of this. I saw
this recent post. If you're like kind of
deep in the model architecture world,
maybe you've seen this. This is like a
couple weeks ago. There's new Chinese
model MiniMax M2 that's one of the state
of the art open models. And a bunch of
the other Chinese labs have been pushing
these new hybrid architectures that are
like more efficient and can take longer
context. And MiniMax M2 just didn't do
that. They just used sort of like the
regular quadratic attention that I was
showing you. And they have this really
long story about how they tried and
tried and
it's basically just not worth it.
There's like an inherent trade-off in
how much computation you use and and how
good the models are. And so even if you
can technically build a model that
doesn't break at millions of tokens,
it's not actually better for any of the
tasks they care about. So no one is
really doing this.
And I think to conclude, we think that
like we're pretty limited by the context
window in full context. There's like one
systems problem that you can't put
millions of tokens into the model. And
then there's another reasoning problem
that even if you can, the models don't
actually get better. So it's probably
not practical. And I think if if you
work in industry, I'm sure you see
document sets that are much much larger
like on the order of I don't know
billions to trillions of tokens and even
though we're getting better at training
the models and the system side we're
getting much better at running them more
efficiently, faster, cheaper, we're not
near fitting trillions of tokens into a
model. I think like that's pretty far
off.
So I would guess a lot of you are doing
rag. How many people in this room use or
work on a rag system on like a weekly
basis?
That's actually pretty crazy. Okay, so
over half for sure.
So now we're going to talk about rag.
I'm going to talk about why it's good
and then I'll talk about why I think um
it's fundamentally limited and
the products of the future will use
something better than rag.
So if you use rag you probably use a
vector database. There are many vector
databases. I think
I know some of these Turbo puffer,
Weaviate,
another one is S3 that's Chroma. I made
this slide.
Uh
Uh there there are many different vector
databases. They all offer you like
slightly different trade-offs. They give
you your vectors for cheaper, faster, um
Vector databases are the way that memory
works in production. If you're using a
company internal question answering
system, it's it's definitely running on
rag which is powered by a vector
database which stores embeddings.
ChatGPT memory uh uses embeddings.
Uh
Andre Karpathy has this diagram from
last year, 2 years ago actually, of what
the
an operating system that runs on
language models would look like and he
called embeddings the file system of
LLMs. Um I think that's true in today's
terms like today, November 22nd, 2025,
probably like if you think of what
you're working on as an operating system
the file system is embeddings. But I
think embeddings are the file system of
today and they're not the file system of
the future. And that's what I'm going to
talk about today.
I I also want to point out that they're
extremely easy to use like any of the
tools I'm going to talk about at the end
of the talk that are like related to
training things into models are just
fundamentally harder. But this is just
really nice and we can all take a moment
to appreciate it. You just sort of
take your text and then you like run
this
and and that's all. So five lines of
code. That's that's really really good.
The problem is they just aren't that
good and I they have a lot of problems I
think. Um which I think also okay, how
many people work on rag or experience
a rag system and are satisfied
completely with
>> [laughter]
>> Okay, that's great. So I think we're all
kind of in agreement here that maybe
there there could be something more.
Like even if we don't know exactly what
it is there must be something else out
there.
Um
I'll talk about a few problems that I've
run into in my own research. So let's
like start with this abstract. And so
this is the vector database that powers
rag.
Every dot here is is supposed to be a
document. So the document goes through
the LLM. The LLM is trained to give you
just this one vector that represents the
document. I've projected them down to
two dimensions for this slide, but each
doc document is one dot. Um if you
actually look at what's in the vector
database it looks like this. So there's
lots of
numbers. There's no one in the world who
can tell tell you what this means. Um
one thing that I think is interesting is
that even though they look random and no
one can actually read them, if you build
a system to read them, it works pretty
well. So like if you're working in a rag
and you're sending someone embeddings,
you're actually sending them
something analogous to text. And I think
this is important because a lot of the
actual architectures like Turbo puffer,
Pinecone, what have you, they store only
embeddings. And so like maybe there's
this false premise that if you just send
them embeddings there's no security
flaws, but actually uh even slightly
motivated person can build this system
here, this white arrow on the right,
which takes the embedding and produces
maybe not the exact same text, but
something extremely close to it. This is
what I worked on for like about a year
of my PhD. This is a
animation of like so I type in the
sentence, it goes into the embedding
model, it gets stored in a vector
database, and then we run this it's like
a multi-round correction thing. And then
by the end we actually can get most I
think our research has at a certain
length we can get 90% of text back
exactly from vector databases. So the
takeaway here is that there's no uh
security benefits to using a vector
database. And also they're very hard to
run at scale. So this is like an
inherent problem for people with
sensitive data. That's the paper.
Um
I think a second problem that I
personally have with embeddings is that
they're not adaptive. Like there's this
one universal sense of what the world
looks like that's captured in these
vectors and it's not adjustable based on
what you work on. So like to give you a
concrete example,
we embedded a bunch of databases or we
created a database of a bunch of
embeddings of credit card related
documents. I think we had half of them
that were from MasterCard and half of
them that were from Visa. But if you
actually look at where the embeddings
get stored,
um I guess it's not in this picture, but
it's like only right here. So even
though there's this like really large
space of kind of all possible semantics,
embeddings only represent like one
universal one if that makes sense. So
credit cards are actually clustered in
this like really small area and this
means search works bad. So
like to give you a concrete example, if
you take these two documents, one's from
Visa, one's from MasterCard, in these in
the system we were designing like if you
search something that's about a Visa
query, you should never receive
MasterCard. But they're all so close to
each other that they're actually like
completely all jumbled together. And
this is just like a problem with all
conventional embedding mechanisms. So we
built this new model that lets you feed
in some like surrounding documents. So
like to give you an example, this is
kind of the first half of our model. We
would feed in a bunch of credit cards. I
just put Amex, but there actually was no
Amex when we did it. And um
and the model kind of works like this.
Like when it produces the embedding for
the text, which is here, it also looks
at a bunch of surrounding documents so
it can kind of know like, okay, this
text is about Visa, but also all the
other documents are about either Visa or
MasterCard. And it gets trained so that
it can like dynamically adjust the
embeddings based on like the surrounding
context. So I thought this was cool.
And it works better. So like in this
Visa MasterCard case, the similarity
between a Visa and MasterCard is now
0.144. And I think anything containing
Visa has a much higher similarity.
So that's like maybe correcting one
small thing. Um it works better on like
out of domain stuff. So we have a
forgot what the climate data set is. A
data set of arguments, a data set of
financial questions, and then I think
like scientific articles, and I guess
the point I'm making here is that if you
do this contextual thing, embeddings
work a bit better. So like if you build
them in a way that they can dynamically
adapt to the domain, they can solve some
problems, but I think at the end of the
day they're still embeddings. And so
you're
Yeah, yeah.
Was this approach picked up by anyone
else? Do you know if they know it?
Yeah, I think we know they're using it
at OpenAI in practice like behind the
scenes now that embedding models are
contextual. It's a pretty it's kind of a
free lunch. Like you add these extra
tokens. Uh
I guess it's it's kind of hard to build.
Like you have to build this two-stage
model and then uh when you embed
something you have to grab some
embeddings from the surrounding
documents, but once you build it it just
works, you know, better on like
especially on long tail stuff. I think
if you look at um
like MS Marco, which is this large web
scale
embedding task, it really doesn't get
much better when you add surrounding
stuff because like it's already pretty
global if that makes sense. But if you
look at like really niche things,
embeddings work a lot better. So yeah, I
I know it's productionized at some other
companies. Um I think if you're actually
building an embedding model at your
company and you want to put effort into
making it better, this is probably like
the easiest way besides data. Probably
the first way is data. Um
There's some recent work that I think is
worth mentioning about like fundamental
limitations of embeddings and vector
databases and rag which says that like
if you
it's not even really worth explaining,
but there's like some
uh
there there's some relationships that
cannot be captured in fixed dimensional
vector. Like you have to reason about
things to answer all possible tasks. And
this is this kind of combinatorial setup
where there are so many possible
relationships that the embeddings simply
can't store them. And so like in theory
embeddings are obviously
not the best way to do all possible
relationships between text.
But I think everyone knows that rag has
its issues. Like I'm glad that no one
raised their hand when I asked if anyone
was going to like really stand up and
speak for rag. And like we can I I
actually think this is a hard point to
make. Like everyone kind of knows this,
but it's hard to come up with examples
that retrieval can't solve in practice.
Like speaking as someone who's recently
sat down and tried to make benchmarks
for tasks that I care about, it's hard
to express questions that require kind
of this like latent reasoning over
multiple documents in a way that rag
doesn't solve. But they do appear. Like
um anything that kind of requires
association between multiple things or
questions that are they're like sort of
implied but not explicitly answered by
the documents are just not solvable by
current techniques. And also if you have
interesting examples of this would love
to hear after after the presentation. Um
Hopefully I've made my case that I think
rag Oh, yeah, yeah, go ahead.
I'm curious if you would classify
agentic search as rag as well. Yes,
that's a good question. So, I guess the
way I think of agentic search is like a
model that can grab and it makes a bunch
of queries in a row and then it
responds. Um
Yeah, that's that's a really good
question. I think
I think I wouldn't classify it as rag,
but I think it has
different fundamental limitations that
are also tough to overcome. Like what
you what you would really want is like a
model that reads the entire thing and
reasons about every possible
relationship and then answers. And I
think in theory maybe you could build an
agentic rag system that does that, but
it would be very expensive.
I think cuz [clears throat] isn't that
isn't that oh isn't deep research in the
direction of that where like it goes
through a bunch of hundreds of thousands
of sources, but then what ends up in
context is only like a small subset of
those. Yeah, yeah, I actually think deep
research is like really in the right
direction. Like they're trying to
do something that's a little bit higher
level and requires a lot of compute.
Like I think um anything that works
better than rag is going to be more
expensive. And so like just the property
that it takes a while and it makes a lot
of searches and it thinks a lot is like
good. I think that there's probably a
more elegant way to train like a really
big kind of deep research ask system.
But I think that's that's actually a
a good way of doing this and and not the
one that I'm talking about today, but
it's very promising as well. Like maybe
the question is like are you willing to
spend a lot of money at training time or
at inference time? And deep research is
like kind of they don't spend a lot of
money to train it, but it's willing to
wait for a long time at inference. And I
think the things I'm going to talk about
today are more like if you're willing to
spend a lot of money up front and you
get a really smart model that knows all
your data already. Um
and it's really cheap to do inference.
So, it's like kind of different sides of
the same trade-off. And I think like a
good way of thinking about these things
is like to get better models you're
going to need to pay somewhere, you
know? Like you're either going to need
to like generate better data and spend
more time on the data, you're going to
need to spend time on training, or
you're going to need to spend time on
inference. And a nice thing about rag is
it kind of just works, but anything
better will cost more. Yeah. Uh getting
back to your example of MasterCard
versus Visa. Yeah, sure. I I I don't
know if that's in your presentation
later, but what are your thoughts on
using knowledge graph for that?
As kind of augmenting
It's a good question. Maybe ask me
after. I have to think about knowledge
graphs. It's been a while.
Um
so, let's talk about how to learn things
in weights.
Um I think like the question that we
want to get at is like okay, so say we
have the example I showed earlier or
like you have a small data set you
collected from your own personal work
and you want to teach it to the model.
It's
one thing to put it into context and
that's a good way to get started. And if
you don't have that much data, that'll
get you pretty far. But I think we can
do more. Like there's some questions
that even when your data is in context,
the model can't answer. And so what I
want us to think about is like how can
we inject things into a model
uh in such that it learns better than in
context and also that it doesn't forget
everything that it already knows.
Um I want to point out something from my
own research, which is that there is a
fixed capacity to language models. Like
one way to think about this is chat GPT
has like only so many parameters. We
have this measurement that it can store
3.6 bits per parameter. So like uh I
think a billion parameter model is like
at 3.6 bits is maybe like
4 terabytes? Is that right? 4 gigabytes?
What a Yeah, thank you. Thank you. Um
this is like some information, but it's
actually not that much. So, the models
they basically do their best to fit the
training distribution and they throw
everything else out. So like to give you
a concrete example, this morning I was
putting this together. I asked Claude,
what is the capital of the smallest
province in Tajikistan? And
it gave me a very detailed answer. It's
actually very impressive. No web search,
the model just knows this in its
parameters. I guess I'm arguing that
this is bad. Like if you want to build a
system that can answer
really detailed documentation questions
for your company,
you don't need it to know what the
capital of the smallest province in
Tajikistan is. And since we know these
models have fixed capacity, I think that
this is bad. Like what we really want is
to know how to like find this kind of
thing and just like delete it and
replace it with the things we care
about. And I think that's like what
we're getting towards, but we don't 100%
know how to do that again.
Oh, sorry. So, when I originally put
this talk together, the way I was
thinking of explaining it is calling it
a neural file system. And then I decided
to just call it weights. I think it's
easier to understand, but this slide
still says neural file systems. Um
So, I think there's a few questions
here. Like we want to train all our data
into the model. One question is like how
do we train it? Do we do RL? Do we do
SFT? Uh what's what even is the data? Um
another question is like out of
uh all the possible data, what do we
use? Do we just like fine-tune directly
on our data? Do we try to generate more?
I think my argument is that we should
try to generate more and I'll show you
why. And then there's an architectural
question. Like I think for a long time
people really cared in the machine
learning deep learning community about
like what architectures we should use.
And then for like what, 8 years,
everyone who knows what they're doing is
really just been using transformers
unless they're trying to make them
better. And I think now in this world
where we're trying to train stuff into
models, like
like if you think of okay, a world we
all each of us have has our own model or
maybe multiple models and those models
are getting updated a lot, I think we
start to care about architecture again.
And I'll and I'll tell you why and like
what I think the options are.
So, [clears throat] first let's talk
about learning.
Um
So, I think like the mental [snorts]
model here, which I mentioned before, is
like we're trying to train the model to
learn the data as best as it possibly
can and it's going to be expensive. So,
like we didn't like rag, but also rag
didn't cost us very much money. I think
to do better than rag, we're going to
have to like pay some GPU points. And
that's just like the state of the world.
Okay, fine. So, this is our model. It's
like this homogeneous blob of data and
this is our data. So like maybe we have
the MasterCard data set or maybe we
collected data about ourselves or maybe
I uh
collected all my traces from coding in
November and December and I want to like
train the the model to learn my problems
better. What do I do? How do I actually
do this? Um
Let's let's like start with the dumbest
possible approach and just like see what
happens. So, say
uh we start with a data set
and we just train on it.
Um like using I guess next token
prediction. So, we actually ran this
little experiment. This is like uh 3M.
It's a company that makes duct tape.
And
um
this is like some financial reports. So,
maybe like you're working there and you
really don't want to read all of this.
So, you just want to ask the model to
like really understand this and be able
to answer questions. And like rag isn't
really working cuz it's like this weird
structure and there's a lot of ways the
documents interrelate. Okay, cool. So,
we're just going to like train the model
using next token prediction. See what
happens. You know what? Actually, even
if you don't train the whole model, um
you you still get zero loss. So, the
model can perfectly memorize
entire uh 3M 10K financial report. Um
it's extremely impressive.
Okay, so now let's talk to it. So, so we
did this and then we didn't want to ask
anything that's like exactly present in
the document cuz we want to see if the
model's actually good. So, we started,
you know, like everyone loves to test
poems. So, we started with a poem. We
said, "Can you write a poem about 3M in
fiscal year 2025?"
So, register your bets. And what do you
think happened?
It's terrible. Someone said it. It says,
"The passage of a passage is a poem."
End of sentence.
It's crazy.
>> [laughter]
>> Yeah. So, now maybe we ask like why does
this happen and and how do we fix it?
So, unfortunately this doesn't work and
I actually think this is like one of the
reasons why people haven't been doing
this yet is because the dumbest possible
approach usually does work in machine
learning, but in this case we have to do
something a little bit more
sophisticated. Um
So, maybe take a second and think about
like what you would do when you're
facing this problem at work or in a side
project. Um I think there's like two
things we need to fix. One is that um
the data
is not it's not exactly what we want to
train on, I think. And two is that we
probably don't want to update the entire
model because what we did there was
basically overwrite all the, you know,
stuff about Tajikistan and everything
else that's in the model with just like
this 3M knowledge. And I think that's
like too specific and then the model is
just obsessed with 3M and it'll only
produce exact copy sentences from the
document. That's that's clearly too
much. So, I think we need a better way
to update the model and we need a better
way to change the data.
Um there's this pretty relevant work. I
don't know if you follow this like LLM
chat thing from Andrej Karpathy. Shout
out. I think it's very educational. And
he had a really good question, which is
like he built this small LLM and trained
it from scratch and everything. And then
he wanted to teach it about himself. And
okay, maybe the first thing you would
try is rag. You put like a little
database of information about yourself,
but that's only scalable to a certain
amount and then the model can't really
like combine things. It can only
kind of regurgitate facts. And so he
wants to actually treat teach it
properly, he says, meaning in weights.
And so notice he doesn't just like take
one example and and train the model
using next token prediction. He does
something a bit more complicated. He
like generates this task or
don't have to care about the specifics,
but there's like basically he makes a
diverse training data set of examples
that look like the thing he cares about
and then trains on it. And if you go,
you can find this it actually does work
pretty well, which is cool. So he's able
to teach a novel behavior to a model by
like generating a lot of synthetic data
that looks like the example he cares
about and then fine-tuning the model for
a little bit and it and it learns.
There's a paper that's really good
that's from last year from some folks at
Stanford called synthetic continued
pre-training and they have the same
problem. So they have like a really
small data set and they want to teach
the model to the data set without like
breaking the model essentially.
And
they have this kind of fancy way of
generating synthetic data by extracting
entities, but I think the important part
is that they take a small data set and
they generate like a very large more
diverse data set representative of the
thing that they care about. And this is
something that like breaks the whole
like conventional machine learning
paradigm. Like they only have a small
training data set, so
uh what you learn in school would tell
you that you would just like overfit and
there's nothing you can do, you just
have to go back and collect more data.
But actually because LLMs are so good
now, we can do this second thing where
we generate like a much larger training
data set. It really contains only the
like facts that were present in the
original data, but it's so large that
you can train a model on it. It's like
very strange and only recently started
working, but it does work. I'll show you
some evidence. Um the green line is what
happens when you do the dumb thing you
tried before. So you just like fine-tune
the model on the data. It actually
starts at the black line, so
[clears throat] surprisingly actually
gets worse. So it like memorizes the
data so well that it can't answer any
slightly different questions about it.
Um the thing they do, they have like two
different ways of doing it, but
basically like generating lots of
synthetic data that describes the things
in the original data set. It works very
well. Like at some scale, I guess
100 million tokens close to a billion,
they can actually outperform GPT-4 on
this data set, which is really cool. So
I think like the takeaway here is
even though you don't have a lot of
data, if you're willing to generate like
a large synthetic data set that
describes the data you have, you can
actually train a model on it and it
works really well.
There's a bunch of other papers that do
this. One is called active reading.
Um they basically ask the LLM what types
of things should we generate and then
they generate from it. There's
self-study, which is from this
cartridges paper, which is more like
question answering, like asking the
model to like quiz itself. And then
there's this rephrasing the web thing. I
didn't realize my
Whatever. A rephrasing the web thing
where they kind of like rephrase the
entire pre-training data set. So this
actually works at scale in kind of a
surprising way. Um and there's a lot
more work in this direction. So I'm
really excited about this like and I'm
kind of monitoring it. There's a company
called Datalogi that's doing this really
well. They're like generating really
high quality synthetic data. It's just
like not something that used to be
possible until very recently when LLMs
crossed some threshold that they're like
able to generate data that's good enough
to actually train themselves on. Oh,
there's actually something pretty cool
that's not in the slides called
self-adapting language models.
Self-edit. It's called SEAL, S E A L,
and they
ask the model what data to generate to
make itself better. And under some like
constraint scenarios, this is actually
working. So that's like actually quite
bizarre.
And like obviously doesn't work
infinitely or else they would have
caused an intelligence explosion. But
the fact that it works at all is like
really remarkable and I think like worth
monitoring. So
in conclusion for this section, we want
to train things into weights. We can
generate large synthetic data sets that
describe very pretty small data sets and
it works fine.
Um
now I think the money question here is
like how do we inject the information
into the model? I think before I
mentioned we were training all the
parameters and we tried it and it worked
really bad and this is a a problem
that's been around for a long time. It's
called like catastrophic forgetting.
Um even in old school machine learning
like you train a model to
recognize handwritten digits and then
you train a model to recognize house
numbers and it's no longer able to
recognize handwritten digits. This is
like a very well-known problem. There's
a lot of like theory and like approaches
proposed to solve it, but no one really
knows how to solve it. It's very very
hard. Um
but I think there are some easy ways we
can get around it in the conventional
paradigm where we have like this big
pre-trained try GPT transformer.
Uh instead of retraining the entire
model, there's a few different ways we
can do it. I mean the first one
is retraining the entire model. So the
things we're training I'm highlighting
in blue here. That's like if we take our
transformer and we update all the
parameters,
we're probably going to forget stuff.
Um there's another one that's pretty
cool called prefix tuning where you just
train the KV cache. Um I mean we all
like skip the details for now, but ask
me if you have questions. Prefix
tuning's cool. Um another way is since a
lot of these models are called like
mixture experts and they have this MLP
layer in them, you can add another
part to the MLP that is optionally
routed to and used and that's like
pretty scalable. I think people try
this.
Um
there's another approach where where you
replace instead of like another MLP, you
build this thing called a memory layer,
which is like a big lookup table. I
think memory layers are really good. And
let me pause and say now this part of
the talk is getting close to purely
speculative. This is like the things
that are like they exist and like
someone's going to do this and someone's
going to use like one of them, but I
really don't know what the right answer
is. Um another one is called LoRA, so
low-rank adaptation. You probably heard
of this very like hot topic. Um they
kind of like train a small a small
matrix or small few matrices to adapt
the linear layers. So it's like if your
model's 10 billion parameters, maybe you
train 10 million parameters that can
like control it. Um
and if we look at them together, maybe
it's not super obvious which thing would
work best. Like ICL is just like putting
stuff in context. So we have in context,
RAG, full fine-tuning. We could do the
memory layers and MLP, cartridges, which
is prefix tuning. And we can do LoRA.
You could also do add something to the
mixture experts. I think to me it's not
like clear and I'm not positive that it
matters which one we do. Like I think
the main thing is like we have this
giant model and we're adding a tiny bit
to it to control it and train only those
parameters. That way we retain most of
the information in the model. I think
that's like the most important part.
But I think for the end of this talk,
I'll just talk through like
what I think people are doing in this
space up to like the minute and then
you can make up your own mind what you
think the right way to do it is.
So let's talk for a second about what
properties we want. I think we want um
we want our changes to the model to be
very small. Like say you're serving a
model to each person, you actually can
do it, but you have to use one of these
like parameter efficient methods. If
you're trying to fine-tune a new Kimmy
for each person, Kimmy's like a
terabyte, it's a trillion parameters.
It's just like not even storable, let
alone servable. Um we want something
that's resistant to forgetting like we
said, so it would be nice to have
an architectural change that's both
small and makes the minimal impact on
the model as it is now because the model
as it is now works really well. Um and
preferably high capacity. I think like
changes that are really expressive and
can capture a lot of facts in few
parameters are the ones that we prefer.
And we want to be able to do inference
quickly. As like a small aside, you
actually can do this quickly with a lot
of um
a lot of these methods. Like maybe some
of you have seen Tinker, this new
training API from Thinking Machines.
It's basically all predicated on this
idea that you can
you can serve one model per person as
long as you do LoRA and batch the LoRAs.
And there's like it's actually most
interesting from a consistency
perspective. There's like ways you can
train it and train each one separately
and there's ways you can do inference
and it basically has no cost, um which
is really interesting just cuz like the
base model doesn't change and we all
share the same base model. So all the
ideas I'm going to be talk about are
kind of like in the same direction as
Tinker. Um
we can think about like whether certain
methods might learn more or forget more.
Um
so this is comparing LoRA to full
fine-tuning. So LoRA makes a tiny change
to the model. Full fine-tuning updates
the entire model. And on two different
settings, they show like LoRA here is
like purplish or pink. The pink one's a
little bit smaller capacity. Um it
basically doesn't do as well at least
when you're doing SFT.
Uh LoRA can learn a little bit less,
but also if we look at how much it's
degrading, it forgets less. So this
paper's called learn LoRA learns less
and forgets less. And it's it's actually
very nice finding. So like if you want
to at least teach a model via SFT
and you use one of these low-rank or
parameter efficient methods like all the
ones I described, they're going to make
a small change to the model in a way
that it's probably not going to be as
expressive as full fine-tuning, but it
also doesn't destroy a lot of the
knowledge. Um
here's something going in the exact
opposite direction. This is the result
from Thinking Machines showing that they
think LoRA is about as good as full
fine-tuning, which is interesting
because they're doing RL. So it's like
maybe dependent on the training
mechanism, like if you do RL, maybe it
makes small updates and
um
you can do LoRA, you can do memory
layers, but for SFT, it really has to
store a lot of information, so you
really have to do full fine-tuning. I
think that's the takeaway I have, and I
have some actually a paper that's like
kind of
blocked for legal reasons, but coming
out soon. Um here's one result from my
paper that's relevant to this. So, we
have this like tiny LoRA thing that's
even smaller than LoRA. And well,
there's actually LoRA XS which already
exists, and then we made tiny LoRA which
is even smaller. And if you're doing RL
on GSM8K
math reasoning, [clears throat]
you can train
14 parameters and get like 91% accuracy,
which is pretty crazy. I think um
there's like a lot of reasons for this.
Like RL makes really tiny changes. I
think this QLoRA model like is something
fishy is going on with the training
data. Do you have a one parameter
experiment? Uh yeah, yeah, it's just one
parameter. It actually learns It gets 5%
better with one parameter.
>> [laughter]
>> Pretty cool.
>> Yeah, yeah, it's it's it's really nice.
I think um
Literally the smallest. Yeah, yeah, the
smallest thing you can possibly train.
It's more like you you generate a lot of
random projections, and then you control
them all with one number, if that makes
sense.
Like the model actually changes a lot,
but the only thing you can actually
train and store is the one parameter.
Uh tell me more about it later.
>> Yeah.
Um yeah, it's pretty cool. Um
This is another result that's like kind
of in the mix, but I'm not sure how to
place it. So, if you do the KV cache
tuning or prefix tuning, this paper
thinks prefix tuning works much better
than LoRA. I met some people in Meta um
when I used to be affiliated there that
said that they think LoRA works much
better than prefix tuning. So, I really
don't know, but I think like what it
really will come down to is like when
you do it at scale, what's like most
efficient. And I'm not exactly sure, but
I think prefix tuning is a pretty good
candidate because like KV caches are so
commonly used these days and like a lot
of the system stuff is built around KV
caches. I think a cool thing about
Thinking Machines is like they're
designing this entire organization
around like scaling LoRA, which is
awesome, but it's not really possible in
open source right now. Like there's not
kernels for training many LoRAs at the
same time. It's like very complex, and
you have to have a lot of people working
on that. Prefix tuning on the other
hand, it's like very well supported.
Um and then finally, I'll quickly talk
about memory layers. This is another
approach to injecting data into models
that I think is good. This is like uh
adding a expert to the MLP, but the
expert is just like this giant
differentiable lookup table. So, it's
kind of
not that important exactly how it works,
but it's like it's just a different way
to inject information into models. The
cool thing about memory layers is it's
controllable. So, in this work uh by
Justin Lin from this year, they specify
exactly which parts of the memory layer
get updated and keep it to like a very
small number. And so, their result shows
that memory layers actually work the
best. So, memory The axes here are
forgetting, so down is bad, and
learning, right is good. So, the memory
layers basically don't forget at all,
and they learn close to as much. So, I
think if you're trying to
inject information into models and you
really care about them not forgetting
any of their base information, maybe
memory layers are the way to go. I think
honestly there's a lot of conflicting
evidence right now. Like some people
think LoRA is good, some people think
prefix tuning is good, these people
think memory layers is good. I really am
not sure, but I think it's going to be
one of them.
Okay, cool. That's That's the end of the
training stuff into weights part. Maybe
actually I'll stop and see if anyone has
any questions about the different
parameterizations. Yeah.
Can you go back to the
slide where you were showing the
when
GRPO
Oh yeah, yeah, yeah. From from my yet
unreleased research.
So, have you used SFT before?
Yeah, yeah, I can show you the SFT
results later, but SFT
uh
takes a lot more parameters in the short
explanation. Like many, many more, like
a thousand X more or something. And you
attribute that to the sparsity of the
reward? Yeah, yeah, I think it's
something like that. Like the SFT
learning signal is like cross-entropy on
all of the tokens with or without
thinking tokens, and that's a lot of
bits essentially. And then RL just gives
you a one or a zero if you get it right,
and you already knew that it's no
information. If you get it wrong, you
get like one bit. So, I think because RL
is like so sparse and
uh information efficient, then you can
do it with way fewer parameters. That's
That's kind of the takeaway from our
paper, actually.
>> So, you didn't do GRPO after doing SFT?
No, no SFT.
We just either do GRPO or SFT. And then
we see like kind of how many parameters
you need to train to get to equivalent
performance. And SFT requires many more
parameters.
Yeah.
Uh so, here you are comparing like
training versus RAG. Like we are
we want to solve the problem what we are
facing in the RAG. So, if the volume of
the document also matter? Like do you
have any studies like
because if if some problem has a less
number of document,
uh
RAG will be better or the
training will be better? That's a really
good point. Um maybe that let's uh
go to the last slide. So, I think the
question is like, okay, if you're trying
to train all of your data into a model,
but something only happens once. Yeah,
means that when I should pick focus on
RAG and when I should focus on like
like a training because
every time means I have like a small set
of the document, the training might not
be feasible.
Yes, yes, like it
you like maybe you something is so
underrepresented in your data that it
probably wouldn't
>> data is frequently changing, might be.
Your data is changing a lot. Yeah, maybe
in the short term it's hard to train.
Um
Yeah, so let me point out like Okay, so
obviously we're always going to put
stuff into context, and I think we'll
also probably always do RAG. Like I
think um there's basically no scenario
that you can imagine for a long time
where you're just like always training
the model and never doing RAG. I think
you'll do both. I think like maybe if
you have a ton of documents, I don't
know, maybe every day you do this big
training, and then every time you start,
you also do RAG. And so, like what I
really imagine is like
or maybe my my point is that no one is
doing this right now. And like people
will start doing it.
>> you have any like a prediction like
after certain amount of data, like
training will be like more efficient
[cough] than the RAG like Yeah, yeah,
yeah, no, that's a really good question.
Uh no, like I think I think this kind of
thing is really new, so there's a lot of
room for analysis like that. I would
definitely be interested to see both
analysis on how the frequency of
information affects like the tradeoff
and how just like how much data you have
to have for training to become
economically feasible. That's a really
good question.
Yeah. Um is your suggestion kind of in
uh diving more into like the weights
side of uh the presentation to use a
fine-tuned model for like
completion type tasks or also for
embeddings?
Oh yeah, that's a good question. Um
No, I think I think
the fine-tuning I'm talking about is all
for like assistant agent completion. Um
it's an interesting question. You
probably could do like dynamic embedding
model training, but I guess like the way
I think about it is like
the real like 10X improvement here is
going to come from training into
weights. You can maybe make RAG like
2X better if you really, really work,
but I think there's so many fundamental
problems with it that I wouldn't spend
that much time on making embeddings
better.
What were What do you feel like the most
fundamental problem is where even if
like your retrieval is fantastic,
I think like chunking, like um
you just like kind of retrieve some of
the stuff you need, and then you can't
really reason across all of it. And like
I think in the limit like there's some
types of data where like no matter how
you chunk, you'll never get like
everything you need, if that makes
sense. Yeah, totally.
Cool.
Yeah.
Do you see any fundamental limitations
as you scale up the amount of
personalization you need? Let's say you
had a B2C product that had 100 million
or 10 million users with memory for all
Mhm. Do you think that's just not
feasible? You say 10 million users?
Yeah, 10 million, 100 million, somewhere
in that range. Yeah, um
No, no, I actually think it is it is
feasible. Like LoRA, maybe you train
a a few megabytes per user or something.
It's not that crazy, right? Like YouTube
probably has gigabytes
Right, that's a good point.
[clears throat] Like the continual
updates are hard. Like probably in
realistic short term, it's more like you
update once a day or something like
that. But I think that's
that's doable, but you make a good point
that the paradigm I'm describing is much
more expensive than
Also, do you consider there's a lot more
that you can do in the other two like
buckets? You can compress the data
context, you can compress it before you
put it in RAG, break that down. There
are
buckets, you don't just have to use
RAGs, SQLs, and knowledge graphs, all of
them together to accomplish that solve
the problem.
Yeah, yeah, that's a good point. There's
kind of like three axes of optimization
here, and I guess like we are we're
getting pretty good at this. We're okay
at this, and we're horrible at this. And
so, like we'll continue improving upon
all three axes.
Yeah. What's your
like kind of hearing that maybe it's not
just fine-tuning, but what's your kind
of like intuition or guess in terms of
like where the decision boundary is in
terms of investing your effort in those
optimizations. Particularly in like
let's say a couple of years where you
could do something like a deep research
but it would be way cheaper and way
faster.
When what are there
You were saying that there isn't like a
number of documents but what is the
boundary that you would think about
looking at is it the freshness of the
data or is it how fast it's changing or
is it number of documents or what's the
what's the cost? Yeah, I it's a really
good question. I I think um
I think the paradigm I'm describing is
especially effective when you have like
a large amount of data that's not been
indexed into the LLM at all and it gives
you a big benefit there. I think when
you start seeing like
sparser updates to your data set or like
some new data that comes in but it's not
that much and it's like fairly often,
then you probably want to turn to
inference time approaches that are
closer to deep research.
Um yeah, and that guy had a question
long
Yeah. Can you elaborate a little bit
more about the
synthetic data generation? So let's say
that you have an LLM and you need to
get it to talk uh similar to the
language and terminology of like a
proprietary field, right? Like millions
of new documents.
Like how would synthetic data generation
in that context be helpful?
So your company has millions of
documents you said and you want to model
to It's more like a scenario. Yeah,
yeah. Okay. Yeah, yeah, yeah. Cuz it
wouldn't cuz it wouldn't you said you
wouldn't just train off of the next word
prediction,
Yeah.
Um
trials and such as and I think one of
those questions that you have talked
about was
uh synthetic data to
Yeah, yeah. No, I think I think
synthetic data generation could work for
that problem. So I guess like
um
it depends on how information dense your
data is. If you have millions of
documents from your company, I would
guess many of them share formatting and
only contribute maybe like a few bits of
kind of global information to the data
set. And so what you want to think about
is like does there exist a function that
could produce a good training data set
for an LLM that would teach it about my
data? And like there probably is. Like
you could probably design some strategy
that looks at the documents kind of like
figures out what's new about each
document and creates like kind of
question answer pairs.
But this is very blue sky. Like I think
a lot of people are working on this
right now but I don't have like a
a
global answer of how to actually do it.
>> Right now my only solution that I can
think of is um
you know, getting it to generate that
Q&A pairs for you to send. Right. And
then
for other
documents I'm wondering if there's other
ways
Yeah, yeah. I think it also depends on
what types of questions you'll be asking
about the documents. Like what you
really want to model is like all
possible questions or something like
that. But I think Q&A gets you pretty
far.
Yeah.
Um so with with this approach, right?
You you you mentioned this example where
you're um
you would train your model, right? On 3M
uh quarterly earnings, right? And you
can take 10K 10Q documents.
What would like
what would the prompt basically look
like, right? Like is there is there
anything in within like the in-context
learning that would still need to be
specified kind of specified to
bring your data into the context?
Yeah, uh so I think the question was if
you start with the 3M example we had and
you train all that into a model using
something like magic synthetic data,
what is actually the prompt look like?
Yeah. I think actually if you do it
right, you don't need a prompt at all.
Like you can just ask the model a
question, no system prompt, no
extra information and if nothing has
changed, it should know everything. Like
and you even there's some scenarios
where there's only one document and the
model knows which document it is so you
don't have to specify that you're even
asking a question about the document.
It's like implied, you know? So um
it depends on how you set it up but I
think in like the ideal case, there's no
prompt at all.
Yeah.
I
it's not obvious to me that information
is best stored in model weights. Yeah.
Why do you have do you have that?
Um it feels implied.
You have you might be right. Good
question.
So he said it's not obvious that
information needs to be stored in
weights. Yeah, yeah. This is this is a
good question. I think um
I'm not saying that it's best to store
information in weights. I guess I'm
arguing that that gets you a lot and
we're not using it right now.
>> Yeah. And like once you get to the scale
of like a GitHub repo, you might have
millions of tokens and it's just like
very expensive. And so at least like
this is the cheapest way to do it. The
question of like can we generate
synthetic data to do better than
in-context is like it's it's hard, I
think. It's like that's research.
Do you know what I mean when I say it's
cheaper though?
Like if you have a million token prompt,
you can just like compress it into the
weights and produce a model that gives
the same outputs with no prompt. And
then the inference costs less.
We can talk after.
Follow up. Yeah.
Mm. That's actually a really good
question. Never thought of that before.
Um I think it's probably pretty hard.
Like I guess if you're training on user
data and like you have some user that
wants to sabotage your system and you're
generating training data from their
inputs, there probably are a lot of
these like security risks and
uh I guess in this scenario if you're
serving the same models that user said
it doesn't work anymore, that's like not
your problem. But once you start
aggregating information across users, I
bet it becomes hard. I'm sure ChatGPT
has the same problem where some people
always click thumbs down instead of
thumbs up to try to like
>> [laughter]
>> Uh [snorts] they segment it
geographically with countries. With some
cultures are in conflict.
Oh yeah.
So you can
>> [laughter]
>> apply some bias in the
That's funny.
Yeah.
Yeah, um so speaking of maybe
[clears throat] a little bit about
practical limitations of something like
this, um especially in terms of like say
version control that you mentioned
GitHub models that you can keep
fine-tuning over time.
Say you're a company that just changed a
policy in the system one [snorts] line
sentence we honor something to we do not
honor it anymore. Mhm. And that keeps
going back and forth. Do you then, you
know, start from the base model again
and fine-tune that Yeah, yeah. Or the
one that already already has a good
representation of it.
And just has to change that one small
thing and then you know how that kind of
is joined at the hallucinations.
Which is kind of why we were doing full
context, right?
Partially to avoid that.
Yeah, I think it So So his question was
about
what you do once you start making
multiple updates to the model,
especially when you have like
conflicting information. And I think
like the optimal synthetic data strategy
would somehow figure this out during
training and maybe even like if there's
some documents from a few days ago that
are no longer relevant, you can just
like delete them. But I don't know how
to do it.
It's not How how we can give more
attention in the same like whatever
let's say uh
information is conflicting with each
other.
Uh whatever pre-trained versus what uh
friend document we are giving for
training. If it is a contradict each
other but I want more preference from my
document.
Like what we are doing in right like uh
asking the questions from the ground
truth.
So how
uh it will replace that uh scenario?
I'm [clears throat] not sure I
understood your question.
Sorry? I I don't know if I understood
your question.
Okay.
So what you have
I didn't understand your question. So my
question is like I
uh we have the data uh whatever the
training data we are giving it is
contradicting with the pre-training
data.
It is a conflicting.
Now while asking the question while the
inference, I want to give more
preference on my data. I don't need the
pre-training information. That's why we
are using right like uh I need the
output from my ground truth uh whatever
the context I'm giving. So how it will
uh we can achieve in the like a
training?
I think that the
the paradigm I'm proposing has all the
same limitations of RAG.
Uh I'm not positive that answers your
question but like for example, if
uh like with maybe in the scenario he
said where you said something many times
and then turns out not to be true, both
RAG would retrieve that and in the
uh
dumbest setup, that would also be
present a lot in the training data. So I
think like the same problems have to be
solved.
Have you done any work with federated uh
tuning? Fine-tuning uh parameters
So so what might be your problem? Yeah.
Millions of users.
Have you done any research in this space
yet? No, no, no. Not really but I think
it's an interesting uh opportunity. So
like back in the day, a lot of people
were really excited about the idea that
you could share gradients and train the
same model across many machines. This is
federated learning. And I think like one
of the problems why it's hard is because
the models now are so big that the
network costs are way too high. And
because like I'm arguing that you only
need to train a million parameters
instead of a trillion, it probably comes
back into play. So I think it's a very
good idea, especially in the RL world
where you do a lot of work for a long
time and then do gradients like very
seldomly. So I think it probably will
come back and it's smart to think of it,
but it hasn't quite yet.
Um, maybe I'll take like two more
questions. Yeah, go. Um, so
your argument here about training in
um
information seems to be
uh counter to Karpathy's view of like a
reasoning engine, like distilling just
the pure like you know, intelligence
aspect of the of the model down to like
a 2 billion parameter thing.
Um,
and like I think that there's a bit of
overlap there, like um
like a lawyer is not doesn't have the
entire legal code memorized, but they
know how to use the tools available to
them to find what they need to.
And so I I think part of it is kind of a
combination of those two things, where
you're doing task-specific training
with something like this on a relatively
small reasoning brain to get a sense of
where it needs to find the things that
might become stale or or, you know,
am I on the right track here or Yeah,
yeah. So, I think
you're making a comparison between some
people who have said, "Oh, the best
model we could ever have is like really
small and knows nothing, but can use
tools really well or something like
that." And
I guess I I was proposing some similar
ideas. I said models know way too much.
I think everyone agrees. The model
doesn't need to know the capital of the
smallest province in Tajikistan for most
use cases at least in like my life. It
doesn't need to remember, you know,
encryption keys. Yeah, but I think
there's I I think it's a very
philosophical question, but
I think it's really hard to create a
model that doesn't know anything. And
so, I'm more advocating for like
specialized models that are good at
something you care about, but bad at
other things rather than advocating for
a model that's like bad at everything.
Uh okay, last question here. Yeah, have
you ever done any research yet into the
temporal elements of the information?
No, but I think that's like one of the
first things to think about is like,
okay, if you have information from day
one and day two and day three, do you
just sort of like have data everything
or do you train them in order kind of
like you were asking or do you like
train multiple models and merge them or
I actually don't know, but that's a good
segue. So, now I'm
I'm working on this
problems related to this a lot, thinking
about this a lot.
Started a company with a few other
people and um
this is like the kind of research we're
doing. If anyone knows someone who lives
in San Francisco and is good at
engineering and you think they're
interested in this, let me know or send
me an email. Or if you're interested in
like using this kind of thing, send me
an email. That would be great. Is it
temporal stuff or Not necess- I mean,
it's kind of all of this, I would say.
Um, trying to build models that you can
teach things to.
Tell us more.
All right, thanks so much for having me.
This was great.
>> [applause]
[music]
[music]