Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

Channel: aiDotEngineer

Published at: 2026-05-04

YouTube video id: UsB70Tf5zcE

Source: https://www.youtube.com/watch?v=UsB70Tf5zcE

Thank you very much for joining for this
workshop. Uh this is going to be a
hands-on workshop. So hopefully you can
get your hands dirty doing some very
interesting uh a very interesting
project. Uh to start with a little bit
about myself. Uh my name is Angelos. I
lead the speech to text team at 11 Labs.
I'm a research engineer. So I spend most
of my time training new models uh
working on inference and also I'm on the
product side. So I'm I'm responsible for
unfortunately talking to clients
sometimes uh which I don't fully enjoy.
I'm the kind of person that really likes
to go straight into research and do uh
train models and make things that are
state-of-the-art and very powerful. Uh
currently I'm working on uh training
realtime models for transcription uh
specifically for agents and if you have
if you don't know already we have this
scribe v2 model that my team trained is
currently the best transcription model
in the market in terms of like popular
public benchmarks. Uh so if you ever
need any transcription uh use cases feel
free to use it. I think it's quite good.
Uh now as part of the workshop itself uh
today we're going to be training an LLM
from scratch. So, no pre-trained
weights, no uh nothing that you can just
grab online from like a transformers
library. We're going to work purely on
torch and some like very basic
libraries. We could go one level below
and not use torch, but I don't want to
torch you that much. Uh so I think torch
is like a good level and this will be
like a good indication of how like
actual like research engineers like in
in big labs like design their models and
whatever is further than that is more
like optimizations and making the scale
be better and bigger and improving for
specific use cases. So this what you're
going to be doing today is like pretty
much like 80% of of the way there to to
create like a model from scratch. Uh if
you can go on this uh QR code uh you're
going to find this uh GitHub repo. Uh
I'm going to switch to that now. We'll
leave it for a little bit longer. Uh
there's you have two options. Uh one is
to either train the model locally on
your laptop. Like if you have like a 16
GB of memory, you should be able to do
it. It's a small very tiny model that
you can train fast. Uh if not, and I'm
guessing because not many people have
outlets, maybe Google Collab would be
another option. Uh Google Collab gives
you free GPUs that you can use for for
training such a small model. So other it
will be your choice. Um while
I'm going to leave this on for a little
bit longer
and then we can go in
the actual repo itself.
Um so to give you some idea of the
inspiration of this um uh of this
project uh my first exposure to to
transformance and in general was through
this video from from Andre Karpathy who
was one of the co-founders of of OpenAI
uh which was called nano GPT and for me
that was like a very inspirational
project this essentially like what
inspires like this this workshop as
uh it's a bit lower level than what
we're going to be doing now. So it goes
bit deeper like into how you're using
NPI for like calculations but I think
it's a good uh blueprint that we can
follow for for us to be able to create a
model from scratch. Um let me
move this to the screen.
But yeah, if if you don't if you if you
don't know this project, uh I think
that's it's a great introduction. If you
do know this project and you've done
trained elements like that before, uh
this might be seem a bit simple for you,
but I think that there are ways we can
expand this workshop to towards doing
something that's a little more of a
competition, which I would see like to
see what you can come up with. for this
specific um for this specific workshop,
we're going to work with a very small
model that's based on the GP2 GPT2
architecture. Uh it's a bit of an older
architecture, but the the the base and
the the fundamental parts of it are
basically haven't changed too much. Uh
and we're going to go over over those in
in a bit. Uh the the four building
blocks you need to to train a model is
first one is going to be the tokenizer.
Uh, and depending on what your use case
is, you'd want to use a different
tokenizer for that specific use case. If
you want to train, for example, a very
big model that can that want you want to
generate text for like multiple
languages, you'll need a huge tokenizer,
which means you're going to need a huge
amount of data as well to train it. But
for a for a smaller model, smaller
tokenizer with with with a smaller
embeddings is what would work best and
train the fastest when you're when
you're data limited, which is what we
are right now. Uh next would be a model
architecture. Uh to be honest like most
models like have at least at the at the
period that we're going to be working in
were very similar like in in they were
just
decoder only coausal decoder only models
that had a very similar uh way of of
using causal self attention and the same
like MLP layers and the same like layer
norms and all kind of stuff which we'll
we'll get we'll get to it. Um so if you
know how to do this this like small
models it's very easy to go and do the
same process for like bigger like newer
models but of course the newer models
are going to be way more specialized for
longer context and essentially
essentially being able to scale this
train the they're architected in a way
to be able to scale training to as many
tokens as possible which we won't need
in this case. And lastly, it's a it's a
training loop which this is generally
the most important part when you're
training a new model. Like if you check
the difference between like GPT4, GP40
and GPT5 or even even if you go before
that,
what you'd see mostly is that usually
the pre-training is very similar. It's
the fine-tuning and post-raining and
essentially what you use with a with the
same base model or like very similar
base model and the way you train it that
actually makes the big difference in
performances. And now we see for example
uh Geminy 3 comes out and then has this
this this many this good benchmarks and
then 3.1 comes out that has like double
the performance in some benchmarks which
is crazy like obviously it's very s the
models are very similar but actually
during training they train they train
the new model in smarter ways to improve
performance like very substantially.
Now as for the and and lastly of course
the inference part like that's going to
very easy for us like because we're just
it's going to be a small model that can
run everywhere. Uh so like that that's
going to be a very simple uh part of the
the building blocks. Uh then the
prerequisites any laptop would do uh
that has at least six 16 GB of RAM. You
can work with with with smaller laptops
as well. It'll just be a bit slower.
Bigger laptops you can crank up the
batch sizes higher so it will train
faster.
Uh Python 3.12 and I expect I expect
that most of you have like some idea of
how to write Python. uh like if you
don't I think it's still quite you can
just copy paste things until something
works or like ask for some help. Uh this
training can uses uh Apple silicon so
the the MPS archite MPS architecture
CUDA or CPU so you can basically support
everything.
Um and as for getting started uh this
just to make things easier like I'm
using UV for this project. Uh so if you
have your laptop out you can install UV
on your machine if you haven't. Uh it's
it's quite straightforward and the
reason UV is quite simple. You can just
run UV sync and it creates a virtual M
for you and makes your life easier. And
we're going to write the code in
Scratchpad or if you're using Google
Collab which to be honest if your
internet is not very good maybe that's a
better idea. Uh if you just go to
um
if you just go to and create a new um
let me actually create on a separate one
so we can we can follow along.
Yeah. So if you open Google Collab and
you create a new um collab project, you
can run this command
which just installs the what we need
which is the storage numpy tdm and tick
token. Tik Tok is mostly for uh for
testing things out.
Yeah, we're going to move on to the
first part of this um
u of this workshop which is going to be
the tokenizer. As I mentioned like this
is generally the first thing you think
about when you create a new a new any
new transform model is what tokenizer
you're using. I come from the voice
world. So this is where that that's the
one of the most important things like
we're thinking okay we need to train
this new TTS model and we're going to
spend maybe six months thinking about
the tokenizer and then we're going to
spend two months on the architecture. So
that's that's like generally one of the
most important things of deciding how to
create a new transformer. For those who
don't know what a tokenizer is like LLMs
don't see text they they work with
embeddings or like vectors. Uh so we
need like some kind of representation of
those vectors for the model to be able
to process.
Uh what we're going to be using here is
a character level alignment uh a
character level tokenization sorry just
because it's it has the lowest number of
possible tokens. Um in that in our case
for on our data set it's going to be
only 65 embeddings essentially because
it's 65 different characters that will
appear in the in our training data.
And the way it works uh like this is
going to be using the sexier data set
just a few works of sexier is part of
the
um it's part of the repo itself but you
can if you're using
um if you're using collab there's going
to be a link on how to download it on
collab a little bit later
and essentially we're going to use this
this story library which basically just
converts uh
we'll be converting strings to inte
integers and that these indexers will
then be turned to embeddings through the
embedding layer when we train the the
LLM and it's a very very simple
straightforward tokenizer uses this
enumerate uh function from from
Python and just then selects that
specific item that has selected with
this dictionary.
So yeah, as I said before, like we use
we use character level because it's much
easier to train. Uh because we have 65
65 uh tokens only. It means that we the
biogram combinations will be like 65*
65. So 4,225 possible biograms. Bagram
is essentially like when you have one
token and then you predict the next
token after it. And this concept of of
of biogram is very is a very important
concept when when you're training
transformers because you want your model
to see as many possible biograms as
possible. So if you have a model with
let's say 200,000 tokens, you need
200,000 tokens squared at least data to
be able to to train like a from scratch
in a in a in in a very good way. Uh so
the or at least like this is the
magnitude that you're looking for. In
our case, 4,000 biograms is like very
doable. Like this this data set should
include it. All biograms like multiple
times very likely.
Um if we did try to train using a full
tokenizer, this will never converge. We
can just be training it for hours and
hours and then our model will never be
able to to get good results. Now the
problem with with character level uh
tokenizers is that they don't really
scale very well u because the models the
the way they work is that they need to
understand correlation between different
tokens. Um so you can very easily have a
correlation saying the sky is blue like
these tokens combined together like make
a lot of sense but sky and then is and
then bl that it's a bit harder for them
for the model to be able to make like
good uh
to attend to these tokens in in in a
good way. Uh so this will this will work
quite well for our example. But if you
want like a very very good model uh one
it will be expensive to train because of
course like you have to create like a
ton of tokens when during inference and
during training. Uh but also it will
just never converge to something good
because the model the tokens combined
don't make too much sense.
Um so that's our trade-off but it's a
trade-off that we're willing to take
because of course we're running a a
small model. in the future if you want
to expand this to something better if
you want to train like a proper LLM and
you're happy to train for like a week or
using bigger GPUs
uh using a proper uh tokenizer that's a
use like bite parent coding is the most
common uh way of like doing tokenizer
these days which essentially takes all
you can the way you train a BP tokenizer
is that you take all your training data
and you find all common patterns and
combine those common uh those common
character patterns into specific tokens
that then you can reuse and the model
can understand relationships of.
So the the way that this connects to the
model itself is we have this embedding
table which is size vocab size and then
it's what I said before. It takes
a vector of integers and then returns
uh sorry a list of in integers and then
returns a list of vectors which is going
to be our embeddings.
Uh and like like I said before like
that's if we did use uh such a big such
a big tokenizer that would also be like
more than twice the size of our models
because if you just multiply this our uh
embedding size is 384 for this model
we're going to be training. So that's
already like 25,000 parameters. And in
that case like if we used G uh GPD2's
vocab which is 50,000 that would be 19
million parameters which is like more
than three times the model which
wouldn't make sense in our case. Uh so
now moving on to the next part of this
workshop which is going to be the
transformer itself and as as I mentioned
before like the transformers they have
been kind of commoditized now they the
way transformers work like there are
different labs that find different
optimizations but the optimizations like
in principle is more about we have this
like base like idea that works really
well. How can we make it train faster
and like have bigger context and
people find ways that are more
optimizations than nec than than
necessarily reinventing the wheel. Uh
for this at least there are some of
course hybrid approaches that are more
complicated. Uh so I'm not going to go
too deep on how transformers work. Uh
and also to prove a point that you don't
necessarily need to know in a very deep
level how transformers work to to be
able to train something like this. uh
when I first like did this project
myself, I didn't I had no clue how
transformers worked and I still didn't
that much at the end of the pro the work
the previous project I worked on but it
once the more you work on it and the
more you like you have motivation to
continue pushing through you can
understand all different concepts
together and how they they they stuck
together and the reasons why they ended
up the way they are.
Uh to go back to the big picture uh
transformers are
are are using like these four different
building blocks. Uh one is multi head
self attention. Uh attention is like
what the difference between that makes
the transformers different than other
neural networks is that they can
actually attend to previous tokens and
understand the relationship between
tokens that I was I was mentioning
before and that's where attention comes
in. And of course the bigger your
attention is the more the model
understands those relationships. And
going back to what I said, that's what
like big labs like Geminy, they're
trying to do one million context and
they're finding ways because if you just
try to use a 1 million context for for a
model like this, it will just break very
easily like the the math wouldn't work.
So then that's when like the engineers
from the researchers from Germany found
ways to to make it work and that's what
that makes a difference but again
fundamentally it's like the same the
same architecture. Uh next one is is the
MLP uh or like the feed forward network
uh which essentially takes this
different uh um uh relationships between
those tokens and
while taking them while the the
relationship theelves are um arranged
all of the relationship between the
tokens themselves it it combines them
together to to be able to generate the
logits that will then be uh so it's
essentially kind of uh it takes the
context and then organizes it in ways
that the model can then generate the
logits and then generate the tokens. I
hope that was a decent explanation. And
then you have the residual connections
which are basically there for the model
to
to not have to reinvent itself after
every layer. Uh res residual basically
means that every layer that that passes
through the uh because as as as we'll
talk about later transformer is is is is
built on multiple different layers that
we pass through the activations one
after the other and uh residuals are
there so that each activation doesn't
completely uh restart things from
scratch. It just changes them slightly.
So it just takes the previous input and
makes a small and adds a small
difference. The next layer does the same
and next does the same and this will
continue on and this way the model
doesn't
each layer doesn't make a huge change on
the inputs themselves uh and the model
can be more stable during training. And
lastly, the layer normalization has a
very similar uh has a very similar role
in
uh being able to
the layer normalization is able is is
allowing you to scale down those
activations in ways that
uh that allows that those activation not
be to not be exploding into very big
values. So if one layer like multiplies
your activation by 10x, let's say, the
layer norm is going to push it back back
to like normal values. So it doesn't go
10x 10 x 10x and then you can have like
millions of values that start at like
0.5 and then end at 10 million. That's
what the layer norm is for.
But again, these are just the building
blocks. You don't necessarily need to
know like why they're there and what the
what the purpose of them is. like you
you learn this as you start working on
these models more and understand um how
those diff why these decisions were made
because all of them as like as I'm
talking to you of course were done by we
have a certain idea all this didn't work
so let's add this to make it work sorry
go ahead
>> oh oops yeah I was sewing
>> apologies for
Yes. So,
uh I was I was going through the
tokenization.
Uh that was the previous uh what what my
my previous point. Uh and that's if you
if you can go back and you'll have to go
back through and forth through the
slides when you're working on your the
model yourself because you can you'll
have to like copy paste the the parts or
figure them out yourself. But it's it
was explaining why we're going with
character level tokenization and um and
what other options we have. Then for
transformer I was explaining the
different in the big picture the
different blocks of of architecture.
Um and then now going to
how does this look like in code uh
because all the things I described you
actually are very co there's very little
code you need to to act to implement
them. Uh first we start with the basis
of sorry go ahead
>> when it comes to Python because
at least in English you know you have
words vocabulary is kind of fixed but in
Python you have your own variable and so
on. So how does tokenization
work in that setting? What would be a
token? Sorry.
>> So in a programming language of course
you have programming syntax uh keywords
but then you have also normal variables
function names and so on. So when it
comes to tokenization how does that
work? So yeah, as I mentioned to before,
uh
we're going to be using like a car level
tokenizer because it's going to be for
this project. But most big labs, they
don't use car level tokenization. They
they use what I was mentioning earlier,
the BP tokenizer or like bite u
essentially they they what they do is
that they you you look at your training
data. Let's say you have like this many
trillions of tokens and your training
data is going to be including like code
as you mentioned yourself. uh and the
way it works it will just look at common
patterns. Uh so if you have like a lot
of training data that is uh that is code
itself you will you will see and then
you will realize okay four loops seem to
be like a good candidate for for that to
be a token. So the four is going to be
for sure a token and then you might have
some like enumerate enumerate you see
quite commonly in the um in the training
data. So that will be another token. So
you will look at all these different
tokens and then create this tokenizer
based on the common uh relationship
between them. Uh of course like maybe
some languages that are like Pascal that
might not be very common in the
language. So this the keywords of Pascal
might not be like very common. Although
I do think that there probably is like
good representation the tokenizers too
but some other of those like crazy like
wide space only like languages that this
these ones probably are not going to be
work very well.
>> My question was more on the So one thing
is the keywords of the programming
language but then you have your own
variable names which are not common
among different program
>> I'll see that in a moment
>> making them as tokens I'm not sure how
does that help
>> so again like there's no there's no
specific way there's no like human in
the loop in this process it depends on
your training data that you use to train
this tokenizer
if your code like there's like fu and
bar like that's like common variables
these will probably are going to be
tokens like if your your to if If your
variable names are like very strange
like random characters, then yeah,
probably that's not going to be in the
tokenizer. And when this happens, uh the
tokenizer will fall back to character
level or like
b by code level tokenization. So if your
if your token is like just random like
gibbles, it would just be each character
or like some maybe some combinations
might be together uh as different
tokens, but most of them are going to be
separate tokens and that's going to be a
bit of a pain to doing inference and uh
yeah, maybe not not the best if you want
uh efficient inference.
Um but yeah uh going back to to the
transformer side of the things um as I
said the the models are generally quite
uh quite similar. Um and the code for
actually implementing these transformers
is also quite simple and like easy to
write. It's like maybe maybe 100 lines
actually usually more than that less
than that. Um and the first thing we
have to to to accept is what the
parameters of this transformer should
look like. As I mentioned before, vocab
size is the size of your tokenizer. In
our case will be 65. Block size is
essentially the sequence length or like
the the u the context window. Um in our
case that would be 256 which is very
very small for these models but because
we're training a model locally like
that's kind of what we have to do. uh
bigger labs would use like one million
context size for stuff like that. Uh but
generally like 16,000 is like a common
middle ground. Uh then we have the
layers. How many as I mentioned before
like uh transformers have multiple
layers and then you run activations
through each one of those layers. We're
going to go with a modest number of six.
And then the attention heads like the
the way that uh attention works is that
you might you going to have different
heads for attention that are going to be
attending for different things. like one
of the attention heads might be looking
at punctuation, maybe another attention
head might be looking at the grammar. Uh
so all these different attention heads
are attending to like a specific uh
feature of the text or if you're using
audio, a specific feature of the audio,
etc. Uh and lastly, that's the embedding
dimension. Uh that's how big the actual
vectors of those tokens that you uh
create are. Uh in our case, we're going
to start with 38 384. That's like the
standard for GPT2. But a bigger value of
course would have more information per
token. A smaller value would have less.
But that's like a pretty standard value
you can start with.
Um, as for the code itself, uh, feel
free to copy paste part of this if you
want to spend more time understanding
it. I don't want to go too deep into
like all the little details of how
things work. uh but essentially like
you're generally have like this
overarching module in that case we're
going to call that GPT and that
overarching uh top level module is going
to include all the other modules that I
described earlier uh in this case this
will take the config
uh in this case this this will take the
config
that we have we had above and then we
will create this uh using we're using
torch module dict here just because it's
like the easiest way to implement this.
But this is all just math like you can
everything you see here would just be
like calculations like through matrix
multiplications that torch allows us to
abstract and make things easier and
pretty much all the things here are you
see here are like neuronet networks like
smaller or bigger that are combined
together.
>> Uh question.
>> Yes.
right?
>> Yes.
>> Right. So many
related to the number of parameters.
>> Um not necessarily.
Sorry, maybe you can repeat the question
with a microphone.
>> So the the size of the input sequence is
related to the number of parameters in
your model.
Not not necessarily. You can have like
large sequences with like small models
or like small sequences with with with
uh bigger models. The the main thing
that this shows is essentially there are
two ways you can train a model. You can
train a model like that looks at that's
a full attention model. Let's say that
looks at the full attention. So it looks
at the whole past uh essentially uh
which is what we're building here today.
Or you can have a windowed model. That's
where most of the like bigger models
that are out right now that only look up
to like this much in the past.
>> And this parameter is the size of that
window that Okay.
>> Yes. It's what we're going to be using
to train. So the model will will not
have seen anything greater than 256 56
tokens in sequence. If you try to go
above that, it will account it will pass
out. It will have issues.
>> But so let's say we have like
computational capacity to increase the
number of parameters to make the help
the model reason better, right? Which
number would you crank up here to say
it? Like would you increase the number
of layers, attention heads or u
>> I I would increase all of the all all of
what you see here except of maybe the
the embedding uh size many dimension. I
don't remember exactly what other models
do. Uh the exact numbers I think that
seems reasonable to me. Uh but
everything else like works well for like
a small token a small a small model. Uh
256 block size is tiny. Like that's if
you try to run this on JGPT like it
would just forget things that you wrote
like 10 sentences before. Uh but
the the bigger you make this the harder
it is to do the training. And that's
what I was talking about scaling laws is
that you can't just say go here and say
okay actually I don't want a 256 I want
2 million context length.
You can't train this like that at least
with this current architecture. So
that's what like like researchers after
like GPD3.5 came out, they were like,
"Okay, people are complaining that we
only have 16k context size. We want 1
million. How do we do it? You can't just
change this number. You have to change
the architecture to allow training to be
able just going to go like out of memory
like instantly." Uh so that's like one
of the things that researchers now try
to to to figure out how we're going to
be able to increase these numbers here
while keeping training stable.
um
how how do you increase the numbers here
while keeping training stable and still
able to to have enough compute to be
able to do this? I hope this answers
your question.
Awesome. Thank you.
Um so yeah, the first thing we would
have to do is as I mentioned before like
create the the top level representation
of our model that you can see here as
the GPT class. uh I'm not going to go
again to too too much into the details
but like you want to one would be the
embed how the your model like
understands embeddings and also how does
it understand embedding position uh I
don't want to go into details but
essentially you need the tokens need to
be able to understand both u then you'd
have essentially here all the layers
that I was mentioning earlier and each
of them is configured as as a block
which we're going to go up the
definition of block late uh that's a
little bit
But essentially a block is like a layer
of the transformer that includes its own
attention and uh its own like layer
norms and it continues the it feeds into
each other the the one you do the
forward pass bus
and lastly you have the LM head and
essentially the LM head takes all the
outputs of the above and connects them
together into what we call like logits
which is the distribution of what the
the next uh token should be that should
be generated because as I mentioned
earlier like maybe I did maybe I didn't
but the way transformers work is that
they predict next token so you take the
previous the previous uh context you
predict the next token and you have to
sample this token on on on this
distribution
and the LM head creates this
distribution that we we sample from.
Then the the the next important bit that
uh like is again like quite quite
straightforward like in this it's quite
similar in most of these models is a
forward pass uh which essentially allows
the model to
um to go to to take the input which is
going to be the tokens that we mentioned
and push them through all these
different components that that that I
mentioned earlier
uh and just does some uh
does some some some pre-processing,
turns tokens into embeddings and
positional embeddings. Uh,
adds them together, then goes through
all the blocks of a transformer, all the
different layers of the transformer
>> and just runs the the forward pass of
that transformer
and then does the linear norms and then
passes through the LM head to get a
distribution of those logits that we
mentioned earlier.
And this is if you're doing training.
This would also do corresend loss um in
this case. And then at the end you're
going to get logits and loss as the
outputs of your of u of your transform
model. And I do have some diagrams here
that you can see uh like they're quite
they're quite basic but essentially they
show the flow of take token ids you pass
them make them into embeddings token
embeddings and positional embeddings.
You add them together, you pass them to
the transformer, then through the layer
norm that just makes the outputs uh in a
space that the element head can
comprehend easier. And then you return
this distribution of like what the next
token should be that has that has this
uh this size. And as I mentioned before
like 65 is the size of like your v our
vocab or our our token size.
And now self attention is like a little
bit more complicated. I don't want to
again to go too deep of how how that
works. But essentially attention is
there to to understand the relationships
between the tokens. Uh
essentially what what is the important
like if I say the sky is blue and sky
have a a very big correlation. Uh so
that's what attention does based on how
you've trained your weights you will
understand what tokens should be
attending to each other and put higher
emphasis on those specific
relationships.
Now again going back to what I was
saying before about the what that's why
the tokenizer matters a lot because sky
and blue are very easy tok very easy for
the model to make this relationship
while in our case you have to like
combine different groups of tokens
together the different characters
together that's quite a bit harder but
that's that's what attention does it's
it's the what what this token should be
attending to in the past and what has
most importance for
And it has also a forward pass as all of
these different blocks do. And in your
implementation feel free to just copy
paste this these blocks. Uh maybe you
can spend some more time on
understanding the diagrams and and how
things work. Uh but as you can see even
something as complicated as attention
has a very small it's just a few lines
of code to to implement.
And yeah lastly the the MLP block uh is
again as I mentioned earlier also like
again I already talked about the multi
head attent why attention has multiple
heads because each head attends to
different parts of what makes language
what it is and then the MLP block takes
the outputs of attention and and makes
sense of all those different
relationships into something that the
model can then understand a bit better
which again is just essentially just a a
neural network that just combines things
into something that's a bit more
understandable. able for the LM head to
then be able to to generate distribution
for logits.
>> Yeah. Yeah. No, you can interrupt.
>> We have the same
transformer blocks.
>> Uh
>> is it the same architecture
between the different?
>> No, it's each block has its own weight.
is block the way each layer essentially
you'd have each layer that would call
blocks. Uh they usually have like a
different section for a different prefix
for the weights of like MLP would be a
different uh like maybe usually they're
called FFN blocks instead of MLP but
still be the same. So each block has its
own weights and uh the each layer
>> yes each block has its own MLP. So it's
you have each block has its own um as
you can see here its own attention its
own uh linear linear norms and its own
MLP. That's what basically makes a layer
of a of a transformer
and that's what what my point was going
to be later. Everything comes together
into this block that we can see here
which is basically a layer of a
transformer that has some normalization
running the attention to get the
relationship between the models and then
having the MLP that combines the
relationship of sorry the relationship
between tokens and then the MLP that
takes those relationships and turns them
into like a representation that's easy
for the model to uh to create the logits
and this uh oops
This is what I was showing. Sorry, I
have I have two different screens. I
shouldn't be doing that.
But yes, this is this is the transformer
block. This is the the building block
essentially of of a transformer. Uh it
has a layer norm, the attention and
another layer norm. Like not all of the
models have this this kind of
configuration, but this specific one has
this. And then the MLP that takes
everything and creates it into a
representation that makes sense for us
to be able to make the generations.
And uh you can see here like a bit more
of a like a simple diagram of how this
this works.
>> Yes.
>> The original nano GBT had the residual
connection as well. Remember
>> uh so the residual should be there. Um,
>> no, no, I mean on the on carpet np did
you have the
>> I don't I don't remember it's been a
while since I
>> Okay,
>> like I I had the idea but from there but
I paid everything from scratch so I
don't I don't really remember if you did
or not. uh but they are residual as you
can the what the idea of residual uh
would be as you can see here instead of
doing x equals the attention you do x
equals x plus attention so you you get
the difference the the the values of the
activations don't change that much. Uh
that's the idea of like doing the
residuals. Um
uh and yeah so we have this like if you
wanted to implement this yourself you
can copy paste all these different
classes into one file that you can call
model py
and this is essentially all the maths of
how the transformers works um and as I
mentioned before then we have to decide
how big the transform is like the
parameter count that you can see here uh
is just a 10 million parameter based on
what I was showing above
uh you can just find yourself by just
summing all the all the different
parameters of all the different parts of
the different blocks that we had above.
Um the token embeddings is 65 * 384. We
said this is the vector for it
embedding. So it's 25k. The positional
embeddings is 256. 256 remember is the
sequence length, the max sequence length
of the model. So that's another 98k
parameters. And then where the the
biggest part of the logic is is in the
actual transformer blocks themselves
where most of the parameters live. So
you have um you have 4x uh so the four
there comes of the way attention works
where you have key value uh key query
key and value pair. So it's it's part of
the potential has like four different
parameters for for um
for 384 which is the amount of of tokens
times the relationships between those
tokens. So 284 per vector times 384. So
that's 590 that we have for the for the
attention per layer. And then for the
MLP you have again the similar logic of
I'm don't remember what this 1,536 is uh
but uh yeah in the end it turns out into
being 1.2 million. So total amount of
parameter size we're going to be
training is going to be 1.8 million
parameters which should be good for
which should be easy to train for like
most devices.
Uh so yeah you can go back to this. It
has a lot of details that that makes um
that will that will make sense for you
to go back and understand a bit deeper,
do your own research on on this. But
this is like the general like very very
high level idea of how these
transformers work.
And uh we're going to go now to the the
training loop
and that's where most of the meat of
this project is going to be. um where we
have this this transformer uh how we're
going to train it into do what we want
to do. The objective of this training
we're going to be working today is
something that one has to be very easily
recognizable uh that you know when the
model starts working and gro which means
that it it understood what it's trying
to do. It's going to be very easy to
understand that the model actually
works. Now uh and second it should be
something that's like quite easy for the
model to learn. If you teach the model
how to to write Python like a
competitive programming for you, then
yeah, that's a that's that's a that's a
very hard task. So, we're not going to
be able to do it here. But in in our
case, like the the objective is going to
be to create a Shakespeareian like LLM
that can create uh
verses from from sex.
Um
and as we mentioned before the these
models like are learning as next
prediction. So
the way that cross entropy works is that
you take your current the current tokens
that you want to train in this case
let's say that is t0 t1 to tn and then
you want to predict t1 t2 to t+1 tn plus
one uh so the way this cross entropy
works is that you take your sequence and
you just offset it by one as like what
the model needs to learn what to
calculate and of course the last one you
don't have it so you have to like cut it
uh and then the model learns how to
predict the next token based on based on
this logic.
Uh now for the actual training code
itself
um first we need to load the data and we
have this function that that loads the
data. The data is already in the uh repo
itself. It's in this data directory.
It's a it's a collection of different u
different uh like lines and verses from
from sexier. It's it's about 1 million
tokens or a million characters.
And we first load the data. The data
loading is is is very simple, too
simple. So that's one of the things that
I think you can optimize. uh essentially
just takes all the tokens uh splits into
a validation and a and a training set
and then just essentially suffers it and
takes
part uh 256 batches uh of sequences of
uh of of of tokens that are that from
from the text itself and uh
and just essentially use it for
training. But size is going to be 64. So
it takes 256
uh token sequences, 64 of them, stacks
them together and passes it the model to
to to teach it how to do how to train.
So a very simple data loader usually
these data loaders are can be quite
complex uh because of course like if you
have especially when you have higher
context lengths like the way you load
data is like a very big part of how the
model like needs to learn. But in our
case we have a very very simple
implementation.
Uh next up is the way that this how the
code works for um
we need to to use a device. Uh this code
works for MPS, it works for CUDA and it
works for CPU. Depends what your your
laptop supports or like if you run
things on Google Collab. Uh like if you
run Google Collab it will detect CUDA
and that will be quite fast. NPS is also
quite fast. CPU would be the slowest but
that's like a that still would work
decently well.
Uh next up is the learning the learning
rate and that's like a big part of how
the the models are able to learn. Uh the
way the models work is that you first
start generally with as high of a
learning rate as as you can as you can
afford without making the model become
unstable.
So it basically makes you go crazy. Uh
and the way it works, you start at a
very high learning rate, which is
essentially the the amount of the model
is able to learn per step, how much the
mo the weights need to move into the
direction that you want. If you have
very high learning rate, you would
offset your target so it can go off the
rails very fast your model. So you want
to use a very appropriate learning rate
um for your model to be able to train
well. And usually you have this this
concept of a warm-up where you start
with a very small learning rate so that
all your optimizations
the model weights are able to to to
stick to places that are uh that again
that they don't go into like they can
they can start in places that are
appropriate for the training to begin.
So you start at very low learning rate
increase it slightly until you reach the
peak and then at the peak you continue
you start reducing the learning rate.
That's what we call weight decay. um and
until you reach the point that you're
you're satisfied with it. Some people
for some people this is zero like I
prefer to not go to zero because then
it's hard to restart the training but
that's the idea. You want the learning
rate to be less when the model is like
close to being perfect so small you you
want it to start with very big changes
to find like good local minima global
minima and then it for it to calibrate
as training goes goes further.
Um, so we'll start with like a small
warm up of our 100 steps and then we're
going to use a cosine decay until the
max steps that we're going to in this
case is going to be 5,000. So we're
going to start from like very low pick
at 100 steps and then start going down
down to u to 5,000 steps. And that's
what the AdamW normalizer does. Okay,
it's essentially allows this um
the this concept of like controlling the
learning rate um using this cosine decay
and that's the most common normalizer
people use at least used to use. Now
there's better normalizers but this is
the most simple to to start with
and here's how the full training loop
would look like. again like not not not
that much code. You just
uh initialize your config
like in this case it would be like six
uh six layers, six attention heads, 384
embedding size and then 256 that's the
sequence length.
uh you initialize your model
and then you create your um you you
start your optimizer, you start your
steps. TQDM helps with like tracking
like the losses and all that kind of
stuff. You want your loss to start high
and then keep going down until you reach
levels that are acceptable.
Um and then another important part is
evaluations to make sure that your model
actually works well. That's what the val
loss is here.
So
it's very easy for models to like
overfit especially like this this this
small models because you don't have that
much data. So the loss might keep going
down which means that the what the model
predicts and what the training data is
are very similar. So your loss can keep
going down a lot. Uh but actually at one
point maybe your model actually overfits
in this case and when it overfits even
though the loss goes down actually the
performance of the model is worse.
That's why we have this val loss which
is a is a part of the data the model has
never seen and we run a forward pass to
get the loss of that specific part of
the data set and if this is very low it
means the model can is is performing
well because the model has never seen
this this data so it can't memorize
things it had never seen.
So that's that's what we have here with
a val loss
and uh
I'm not going to explain the concept of
of of of
backwards losses but essentially that's
the way the model the model weights move
towards the direction of being
optimized. Uh so for for each step we
have a b size of in our case 256
um 256 tokens but size 64. So we push
this matrix through our model and then
it learns and then the optimizer does an
extra step and now the learning rates
are adjusted depending on what the steps
you're in right now. And lastly just to
have an extra way of uh every every
10,00 steps you save your checkpoint to
be able to restart your training if you
need it from that point. And we're also
running uh inference on the current
checkpoint to see what the model
actually predicts at this point. Uh so
what we're going to start seeing is
when we start from from the beginning
because this is a model that we train
from scratch the loss is going to be
essentially random and in that case that
will be natural log of 65. So it will
start at around 4.17.
Uh that basically means the model like
knows nothing. It has no clue of what
this data is. Uh and slowly we're going
to start seeing the loss going down to
3.3. That's when the model is going to
understand character frequencies. Uh it
will still not be able to do words yet
but it might understand things like th
as part of the which is a common word
like th is going to be part of the
things that it start generating. Then at
around 2.5 it's going to it was going to
get a little bit better about this th
and then it will understand the word in
and stuff like that. Then at about 1.5
to two losses it will start actually
creating words and then at around one
1.0 to 1.2 to that's when the model is
going to start being decent at at this
task. You will actually be able to
understand names from the text. It will
start creating things that start making
sense. But then when the loss starts
going below 1.0 for this specific data
set, that's where we're going to start
seeing overfitting. The model will still
be like producing like reasonable
things, but it will no longer start be
getting better at it. Uh this is like an
example at 200 steps when I was testing
this. uh it was just producing like
complete nonsense.
Uh then the val loss was around 3.5.
Uh then at about 800 steps it started
producing like decent things. Still not
not not great things but it was it was
starting to get there and at 1,000 steps
it got better and there was one point
that the valos actually started
increasing instead of decreasing and
that's where we know the model over fit.
So at around two 2,400 steps is where
the that was the optimal performance of
this model. And then if we kept going
the the performance was actually maybe
not decreasing but the model start
becoming less creative.
So that's one of the things to to keep
in mind. Now Valos is like not the best
metric like if you're actually being
serious about turning LLMs usually you
might have like some benchmarks that are
running as part of your training and you
can see like if the benchmarks are
getting worse or not. Uh but for us
that's like a very easy and cheap way of
like understanding how the model is
doing.
Um
so yeah that what um next steps now like
u would be to actually train this model
yourself. Um
considering the internet is not very
good I would suggest probably using
Google collab for this.
Um you can you can copy paste stuff from
from from from here you can glue things
together. uh copy pasting will probably
like allow you to get 90% there. There's
a lot of room from for uh improvement
for what I written here like on purpose
I made like super simple and there are
things that you can improve yourselves.
Uh but the idea is to um to get
something working. I hope everybody will
be able to get something working if if
you're interested into into working on
this and have something that starting
from nothing and getting like a model
that can actually produce like a result
that seems reasonable. Um after we have
trained a model, the next part is text
generation which is like the inference
side of things. Uh we're just going to
be using uh there's multiple ways to to
do to do inference. One way is like
greedy decoding which I was mentioning
earlier. you have all the logits which
is just a distribution of tokens and you
just take the most likely token and
that's what deg coding means. Uh so
let's say that you have like your the
token T and the token H are are both in
the distribution. One has like 80%
probability the other one has like 15%
probability. You always be taking the
the top the top one. Uh
that's grid decoding. Grid decoding
doesn't work very well for LM. It can
work well for other models, but for LMS
like this essentially makes them very
boring and not very creative in what
they generate. So you pretty much never
want to use grid decoding for LMS. You
would want to use it for for other um
models like transcription for example,
grid decoding is the best because
there's usually like only one way you
can you can transcribe something. You
don't want it to be creative in
transcription. That's that's not a good
idea. Um so that's what grid decoding
is. In our case, we're not going to use
that. we're going to use temperature. So
essentially what temperature is is that
you're not always choosing the highest
probability token. Sometimes you might
choose the second highest prob
probability token or like the third
highest. And uh even though it doesn't
make sense, why would you choose like a
worse token for this situation? It has
proven that this actually makes the
model perform better. Uh
it like sometimes it might like go into
like this weird loop. So there are like
techniques you can make sure that it
doesn't go like crazy by generating some
nonsense. Uh the worst thing is if if it
predicts like an end of transcript token
and end of a text token and then just
like stops the generation which maybe
you have seen sometimes when using CHP
where suddenly just like stops for no
reason. Uh you sometimes is because of
that. But there are ways you can you can
you can um prevent this.
um generally like a 0.7 uh temperature
is like the best the best middle ground
to make inference work well. And then
you have top case sampling uh which is
essentially like it prevents the model
for if you have like let's say five
tokens are very likely and then the
sixth token is like completely unlikely.
top case sampling prevents the model
from predicting this s six very unlikely
token even though the temperature might
you might get unlucky and the
temperature might hit it. Um so that's
what top cape sampling does and this is
like our inference function is like it's
very straightforward. It just does it
just runs uh takes all the tokens as the
input passes it through the model takes
the the logits out of the model and runs
uh soft this what they was describing
before is called softmax. It takes using
the temperature the probabilities and
then uh it it decides what the the next
login should be based on on those
probabilities.
Um
and one thing you can do you can use
seeds and essentially seeds is that like
right now everything will be random if
you just keep returning it. But if you
if you keep retrying it but if you use a
set seed then your inference will always
be returning the same value uh which
will be uh relevant in
later.
Uh then putting it all together.
If we put all together, we should have
three different files. One is like the
model.py that includes our model
architecture. One is train.py which
includes the data data set loading and
the training loop. And lastly, the
generate.py that includes our inference.
And in total, this should be like maybe
a few hundred lines of code. Uh very
straightforward. And that's with even
with this much code like if we have a
lot big hardware, we could train a good
LLM using this architecture. if we have
enough data and enough resources that's
that's all you need essentially and
that's what basically like GP3 and GP2
like when open air released it I
remember when open AAI was about to
release uh GPT2 and they were saying
we're not going to release it because
it's too dangerous for for humanity and
that was like back then is that was the
code that they were working on and like
a lot of data and of course a bigger
model now this seems to be like kind of
kind of funny but for them like that was
like a very war moment that we did this
and then the model like actually
performs really well on tasks. Um but in
the end it was just it was just what you
see here.
Um so yeah putting all together uh you
can uh if you use Google Collab you can
use this this uh snippet of code to
download the the data set and you can
use this pip install command to install
the different dependencies.
Uh and then it should look uh like
something like this where you'd install
your dependencies, you copy your code.
Uh, and you might need to do some um
connection some connections together,
but essentially
like you can run a
a train that you can use a train command
to run to to and use the data set that
you want to use and that this will start
the training and then you can see how
the performance of the model improves
step by step. Like for me this took
about like 15 minutes to train on on
Google Collab and start getting good
results. uh with some improvements you
can make it go faster and maybe you can
make it go slower but actually get
better results. Um but essentially like
it's it's very simple to get to get
working. One thing to remember is that
you have to change your runtime change
runtime type to T4 GPU. Um because this
this this is free and it will it will
run quite fast.
Um yeah, feel free to train like you can
try like maybe starting with a very tiny
model the 0.5 million parameter uh that
only has two layers and only two
attention heads and a smaller embedding
size for for each token and then you can
try bigger and bigger models until you
get some usable results.
Um and as I mentioned before you can try
different context lengths. I started
with 256 you can try bigger 512. Um and
then you can if you want to monitor and
see how your losses went down like in a
nice way you can use this like uh uh
this pipel uh to see like the graphs. Uh
the things that you want to look for is
if your train loss is not decreasing it
means your model is not learning. Uh
that probably means you have a bug in
your code. uh if your uh
it also could be like a a low training
loss. If your train loss is is
decreasing but your VA loss is not is
actually increasing that means you
overfit and if you have very weird
spikes in loss like the loss should be
like very smooth in general. If you have
very weird spikes in your loss, it means
like again there's some some kind of bug
either in your data or in your training.
And uh when the model starts plateauing
and not getting any better, it means you
kind of you have pretty much exhausted
the usefulness of your current data
sets. So either you're going to use a
bigger model or you're going to need you
essentially need more data.
Uh now one one like interesting part
about what uh I would like us to do is
uh I think it would be cool if we had
like some sort of competition of who can
train the best model here. if uh
anybody's able to to get things working.
Uh hopefully vendor is good enough. Uh
and also for further reading you can see
like a lot of the resources for this um
for this workshop.
Uh but essentially what the challenge is
is
we can we can vote all together to find
which which model actually produces the
best verse of Shakespeare or it could be
if you use a different data set a best
poem or in that kind of category. Um
the rules are you have to train the
model yourself like here and today like
it can't be like some you can't ask LGBT
to give you like a good verse you have
to like use your own model and to prove
this you use like a seed with a specific
prompt and see what outputs it gives
you. You're free to like regenerate
things as many times until you you get
the best results. Um
and uh you I'll have like a QR code that
you can submit your results and I can go
around and like help people out that uh
if you need any help getting training
running and uh of course this training
is like super bare bones. There are many
ways you can optimize this and make this
better. So I guess for people that are a
bit more experienced they can they can
implement this this improvements and
maybe get a better outputs of your of
their model. Uh, the winner is going to
get some uh free swag from from Level
Labs, maybe like a hoodie or like some
some free credits. Uh, I'll see what I
can actually give.
So, yeah, this is the submission. Uh, it
needs to be creative and then um and it
needs to be essentially like a good
verse and we can maybe use like a kind
of um
a kind of bracket. Oh, what happened?
Oh, cool. We can use a bracket and we
can vote which verse like sounded the
best. They could be funny, they could be
like like well made like up to you. And
the reproducibility should look
something like this. Uh where you run
Python generate on your best checkpoint,
the prompt that you decide that you can
it's up to you temperature and then
a a seed that that proves that you can
actually produce your results.
uh you can try different model sizes
uh like I guess like 85 million
parameters would be a bit too big for
this but if you have the resources you
can try it uh you can try better
tokenizers like this one of course is a
character based but perhaps you can
train your own BP tokenizer uh based on
that data set and there's also other
tweaks that you can do like bigger
context uh like there are some ting
optimizations like using a dropout value
uh you can stop whenever you when you
feel like the model is good enough,
change the learning rates um and
essentially get like as make the model
as good as possible.
Uh yeah, that that is the idea.
Uh I'm going to go around and help you
out. Sorry. Go
>> ahead.
or just
>> so reasoning models like to to repeat
uh you asking like if reasoning models
are quite a bit different in training
the base the building blocks are very
similar like you can train the same
exact model you can postrain it which is
how like usually reasoning is is uh is
being taught to this model you have a
good base instruct model and then you
postrain it to be a reasoning model this
is very data driven. So you need very
very high quality data and you're going
to use a loss that's like good enough to
be able to uh to learn this data in a in
a very good sense. Uh the complication
of reasoning models is finding this good
chain of thought data. Uh that's why
like OpenAI has like all these labelers
that are like PhD students that they
write down the reasonings of like how to
how they solve problems. Uh because this
data needs to be very high quality
because it's it teaches the model how to
think. So you can't just go on like
Reddit and just get random post. You're
not going to learn how to think this way
for sure. Uh you need like some very
high value good quality that teach this
reasoning process. But in the end
reasoning is is essentially just adding
to the context of the model like this to
the attention essentially uh this like
logic that then the model can when it
generates the response it can go back
and attend to those reasoning tokens and
get a better response out. So it could
be like describing the model a bit
better. Then it goes back, sees those
tokens that you described and say, "Oh,
actually I already figured this out. I'm
going to write it down now.
The microphone is not working. Is it?"
Yeah. Perfect. uh so reasoning and
non-reasoning models share the same base
often and then one is post train in a
different way and add this uh you know
in a way different post training.
>> Yes. So a lot of the labs like let's say
the quen model that was released like
quen 3 usually they release like a base
version and instruct version and usually
the base version doesn't have in a chain
of thought reasoning uh it's usually
like the same model that they first
pre-train to be like quite good maybe do
some fine tuning as well and then the
next step is doing post training to
teach it this kind of like this this uh
this knowledge uh so that's why like you
see a lot of a lot of uh big
improvements that happen very fast in
the industry like Germany 3 to 3.1. It's
essentially giving it like better
reasoning data and like like better
fine-tuning post- training data for like
specific problems. It's a little bit
like benchmarking like usually this data
is like very similar to what the
benchmarks the current good benchmarks
are. But yeah, essentially just taking a
base model using this new data to
improve it to the next level. Thank you.
And second question is compared to what
you're showing us here like this very
barebone model are there like has there
been fundamental innovation in the main
training of the model like powering
nowadays models or is it mostly the same
but just with smarter tricks, small
tweaks, better data and so on. So are
like are everyone still using the same
attention layer and so on or are there
like some fundamental shifts that have
been happening on the latest and maybe
you know maybe don't
>> no yeah no so you're correct that that
so they a lot of this space is the same
they might do some changes in terms of
how attention attends to those different
uh uh tokens because sometimes reasoning
can be quite large so you need big to
sequence lengths to be able to get good
results. Uh so there's a lot of like
tricks that the new labs the the labs do
to make this to to make the attention
more efficient. But overall you can take
like a very you can take this model GPT2
and make it into a reasoning model. Like
if you had the the data and like a big
enough model to actually learn from that
reasoning as well because like these
tiny models reasoning won't help them
too much. But if you had a big enough
model that can get stuff out of the
reasoning. There are people that have
taken like older models like let's say
llama 1B uh that are small and weren't
trained for reasoning and then made them
into reasoning models using the exact
same architecture.
Yes.
>> How much effort you put in, you know,
getting your golden data set and what is
that process that you follow?
>> You mean for post training?
>> Yeah.
>> Yeah. So, like as I mentioned before,
like usually what most labs do is that
they go to companies like Scale AI.
That's probably the biggest one. And
Scale AI has an army of like people that
you can you can say, I want physicists.
Give me data from physicists. Then scale
AI is going to find contract physicists,
pay them like as much money as they need
and then these physicists might like
write things down. They might be
contacted to do different things. Uh but
essentially like a lot of these
companies they like scale AI provides
data for like anthropic another got
bought by Meta so probably not that much
anymore. Uh but essentially they they
these data sets are provided by these
like uh labeling companies. Skila is one
of them. But a lot of the the big labs
have their own labeling teams too that
they hire contractors to to generate
these data sets for them. Uh but as as
you said before like you want you if
this data set has like even some small
issues it can literally make or break
your model. So these data sets are like
kind of the most expensive like they
will cost tons and tons of money but
they actually are the ones that make the
models like as good as they are.
>> Just a quick follow. So even on that
I mean you still have to evaluate right
the answer wouldn't be exactly same. So
are people are really reviewing this or
it's still you're relying on LLM to
evaluate
>> uh you you rely on people for this kind
of stuff. Usually the way it works uh a
big part of my job is actually like
doing this uh organization. Uh, usually
the way it works is you have you might
have like one person that generates the
lab this like very high quality label
data and then you have maybe some
labeler that has graduated into like a
QA position where their job is to
essentially make sure that all the like
other labelers the more entry level
their outputs are correct and it's it's
it's quite the tough job because like if
your QA is not good you get fired. So
it's like it's that that's the way they
keep the level quite high.
>> Yes. So LMS are nonderministic but um
how does that seat parameter work?
>> So the the way the LLMs are non-
deterministic is because they they
essentially use random number generators
for of like the
of of your machine or like of the the
the the GPU that you're currently using
of the the system. The seed essentially
like makes all those different
calculations to be always return always
the same value. So things are no longer
random like you're this is not only for
uh for for this these seeds can be used
for any random generation. So it could
be like for password hes
>> yeah it works the same exact way.
>> Yes.
>> Yes. Yeah. Yeah, if you use grid
decoding, uh there might be still some
things that like like it's good to use a
seed even if you use grid decoding cuz
there might be some other stuff that
maybe you don't you don't you don't
control like maybe you're using some
like sometimes greedy decoding might not
be actual grid coding might be 0.01 like
temperature like for VRM for example
that they do some like tricks like that
to be able to get uh good outputs. Uh so
in general it's good to use seed even if
you're using great decoding.
>> Yes.
>> Um this is obviously all in text 11 Labs
is mostly audio right?
>> Yes.
>> How different is this versus like doing
this with audio?
>> It's surprisingly very similar. It's
more complicated for sure but parts of
the the stack of like most audio models
they also can do text because you need
the models need to understand language
itself and the best way to teach
something language is text. Uh, of
course you contain an audio only model
like and again it's like what I was
talking about the tokenizer like how do
you tokenize audio like what what is the
concept of like a sound like how how big
should your tokenizer be for audio what
should it be only human speech or should
it also be like a music like do you take
the notes of different music instruments
as like different tokens this is the
kind of problems that you have to solve
like if you're creating a a an audio
model that you wouldn't necessarily need
as a text it's a bit easier Uh but the
fundamentals are still like the same.
Like if you want to generate audio, you
would you generate like an audio token
and then that audio token would be using
a tokenizer and then you you it's not
exactly the same because the audio token
then you have to process it in some
different ways uh to make it into like
actual audio. But uh a lot of the same
uh general ideas still apply.
>> But do you have a lot more loss or
something?
With text, you know, like this is the
text. The text can be down to very clear
like a one is a one, but in audio like a
particular pitch or frequency or sound,
but like the ambiguity there, it's
already lost straight away, right?
>> Yeah. Yeah. So the the way it it works
is that you don't use like this course
entropy loss. You use different types of
losses. You can use gross entropy, but
usually it's it's losses that are more
um specialized for what you're trying to
do. Like for example, there's a a loss
called L2 loss, which essentially takes
like two male specttograms and tries to
see the difference between those two
male specttograms, which is essentially
like a sound waves in encoded in a
certain way. And that's like a very
common way for TTS models to be trained.
You train on these specific types of
loss. Uh, and the same thing I was
mentioning earlier, cross entropy
doesn't really work as well for for
things like post- training. You might
use different types of losses there. Or
if you're distilling a model from a big
model to a smaller model, you don't use
cross entropy loss, you might use a KL
divergence loss where you find the the
token distributions of like the bigger
model and you try to match the logits of
the smaller model. So there's different
types of losses for different use cases.
Not all of them work the same way.
>> Yes. So in a sense you have audio right
text vocabulary audio vocabulary
vocabulary together.
So the way multimodel models work uh
usually you don't use tokens in the same
sense um
like okay this is where like it becomes
more complicated because like these
models are not really built for this
like the GP2 model like the the newer
models they have also like an embedding
input uh which essentially as as as I
was mentioning earlier like you have
each token then corresponds to like a
vector but these vectors they don't have
to correspond to specific token you can
take these vectors from other places
too. And what a lot of these labs do is
that instead of having a tokenizer for
like video for example, what they do is
they have another transformer that they
call a video encoder and they put the
video first through that video encoder
and uh this video encoder will be taking
let's say you have a a 30 secondond
video. It will take one frame per second
of this video and it will take those
frames and then put them through this
like new transformer this encoder
transformer that works quite a bit
differently on this one. And what you do
is that you take the final layer of this
transformer, the hidden values which are
also vectors. You take those those
vectors out of this encoder and you're
going to input them in the embedding
layer of the of the transformer model.
So what the model is going to see is
going usually it's like prefixed. So you
take the you take the video push it
through the encoder get some vectors and
then put those vectors in the embedding
input of your transformer that does
text. And the way it would look like if
you look at the sequence it would
probably be like a prompt and then it
would be probably like a video token
representation and then actually but the
embedding of the video token is going to
be overridden by the output of the
encoder.
So that's how like this these multimodal
models work
the same.
>> Yes, it's exactly the same vector the
same for audio. you have an audio
encoder and do the same but but but for
audio and but for in terms of how what
the model cares about, it just cares
about like these embeddings, right? It
don't care if it's text or if it's audio
or if it's or if it's video. It cares
about these vectors and that's how you
you represent them into the the same
dimension as the model the transformer
expects. So there
jumping and the word
is there no similar
maybe there is I'm not it depends on how
you train it like these are the kind of
things that are a bit black boxes maybe
there is maybe there's not actually
that's a good idea too like that would
be a good research paper to to see like
for video encoders do they actually
match in the same dimension as the text
encoders I I don't know I imagine there
probably is um connection. Sorry, you
had the question as well.
>> Uh yeah, I was wondering about like you
do both normal speech but you also do
some music generation. Is that like a
very different problem or will the
architecture be like similar? And also
like when you have this harmonics and
stuff are you still able to do just like
a basic like regressive transformer or
do you have to like do the things depend
more on each other?
You might generate everything at the
same time.
>> You can do both. There's music models
that are autogressive. There's music
models that are diffusers. Uh like it
depends on how you train it. Um like I
think some of the some of the Google
models are like like transformer based.
Some of uh open source models are like
diffuser based. Like both can work can
work very well. It's just that it's it's
a little bit as I said like it's a
little bit more difficult to
uh to to put into perspective this the
concept of like odd music if you
tokenize and predict the next token.
It's kind of hard because it's very
abstract. Uh so usually diffusers work a
bit better in this like image modalities
for generation or like music or like
even audio for some models like they
have some kind of diffuser diffuser um
diffuser part of the process. Um, both
can work. Diffusers are generally a bit
easier to get it working.
I hope this answer makes sense.
>> Yep.
>> How do you
get
a lot easier
to
get
It's very hard. Um, and it's not really
something that you do by by just sitting
down and thinking, okay, I'm going to
tokenize this word to that. It's not
something that you like sit down and
like decide. You use some kind of
process and you train like an audio
tokenizer through you have a it's a very
similar like case I guess in uh how you
train like a text tokenizer. you'd use
your training data and you'd find like
common patterns in the audio and then
tokenize those patterns. Of course, like
when we say audio, we don't don't always
just mean like the actual like sample
rate and like the the audio waves.
Usually convert them to something that's
a bit more easy to tokenize to to do
this processing. The most common one is
like is male spectograms. First you
convert your audio to male spectograms
and then you use this this distribute
this uh essentially arrays of like of
numbers to train your uh your tokenizers
and it will be very dependent on on your
training set. Like if you wanted to
tokenize music and use a music data set,
your audio tokens for music are going to
be very different than if you had a
voice data set that's going to be like
focusing on like human voice.
Now the hard part is what if you want to
do both voice and music? That's where
like that's that's very hard.
Any more questions?
Okay, awesome. Um, yeah, if you if you
want you can start working on uh
training the model if you have a laptop
out or if you have already started. Uh,
you can follow this
uh you can you can follow the the
workshop uh that get something working.
And uh if we do have enough submissions,
we can do the the competition and
and uh see who whoever wins is going to
get some some nice
>> F.
>> Sorry.
>> What's the cut off?
>> Uh begins shortly after.
>> So because we don't have that much time
left, let's just say 5:45.
So, if you have any any questions or
need any help, uh, please call me and
I'll I'll come over.