VoiceVision RAG - Integrating Visual Document Intelligence with Voice Response — Suman Debnath, AWS

Channel: aiDotEngineer
Published at: 2025-12-06
YouTube video id: hwCmfThIiS4
Source: https://www.youtube.com/watch?v=hwCmfThIiS4
All right. So, you're almost on time.
Uh firstly, thank you so much for your
time uh for joining us. And uh what
we're going to do is for next an hour or
so is uh we'll try to explore something
around which is which I found uh pretty
interesting when I started working on
this uh uh and I'll tell you some
background about that how I end up into
this uh on vision based retrieval. uh
but the idea of uh that I had was just
to share a few of my learning on this
particular approach of retrieval
and there are bunch of things that we
have here. Uh I'm going to share one of
the latest research paper around
retrieval which is a uh vision based
retrieval and also
uh I just thought to wrap this around
with an agent. Uh without agent we
cannot talk about anything these days.
So uh right so uh it's funny uh without
agent I had this but then uh the
organizer said that you know we need to
have some agent okay it's not a big deal
right so yeah so
all right so u we'll focus
mostly on the uh
science side of this like how that uh
vision based retrieval works and then we
will switch gears and rapid around with
an agent. I mean that's a very simple
task and uh I'm going to use one of the
open source uh uh framework that uh uh
we launched recently. I think it was two
weeks back called strands agent uh which
is kind of a a framework lightweight
framework to build agentic application.
I'll talk about that little later and I
have a session on that day for tomorrow.
Um but that's the premise and uh before
we get started uh how many of you are
are from uh
science side of things like who how many
of you have worked on transformers?
Okay, perfect. How many of you have
worked on rag in general?
Fantastic. Okay. And uh how many of you
have worked on AWS?
Okay, great. So there's nothing about
AWS here. Okay. So uh so the last
question was sponsored by my manager.
Okay. So what so what we are going to do
is u uh we are going to uh share one uh
notebook. you can just uh clone that
repository and uh there is lot more uh
there inside that uh but we are just
going to use one part of that uh
repository okay and I'm going to share
few of the I think some $25 credit code
uh which I was given so I think uh you
may like to use that so uh let's get
this logistics uh sorted uh first
okay so first thing first Uh
can we just switch the screen please?
Yeah. Uh can you just uh uh take a
moment and see that if the URL is
working?
Uh you may if you are on laptop you may
like to uh
open the URL or you can just take an
image uh on your cell. You can have a
look later on.
Is it working?
>> Okay, perfect. Okay. So
now this is something
uh you can take an image now or you can
uh do that survey later on. I mean I
don't like this but again this was given
by my manager. So so this is just it
might ask you few questions. I have no
idea what question they will ask, but uh
you will get some $25 credit and uh if
you don't want to do it, don't do it.
I'll give you $25 credit. So, I have it.
So,
okay. So,
yeah. And uh I don't know why this slide
is next, but so this is my Oh, I I
actually forgot to introduce myself. So
I work with AWS uh as a principal uh
machine learning advocate. I'm with this
company for last 6 months. I focus
mostly on um natural language and uh rag
and fine-tuning.
And if you have any questions around the
talk that we are going to discuss uh or
anything around uh machine learning or
generative AI feel free to uh ping me.
It's not just about this session but uh
my takeaway at whenever I go and speak
at any conference at this scale is uh
just to make few connections with whom I
can work with uh you know after this
conference
uh because as long as learning is
concerned we can learn everything at
home right and so you don't have to come
to a conference uh so feel free to uh
connect so with that I will just switch
to uh the
GitHub repository. Okay. So, and I'll
just uh walk you through the notebook.
So, the my idea is not to have any
presentation because uh first uh I'm
lazy and second it's little complicated.
I thought that taking images and
embedded in the notebook is uh much
easier. So uh you will find this uh
GitHub repository
uh and in that there are many things
there but what we are going to focus on
is if you come to this uh section 8 and
come to this first one agentic
voice-based rag
uh so I just added that agent thing
yesterday so that's why so I had no idea
of that so uh so what we are going to do
is these two notebooks are exactly same.
One is without output, one is with
output. I find it uh useful to have both
copies because if you are doing it for
the first time um you may like to start
with this don't see the output and run
through and at the same time if you want
to see what it is you know what is the
expected output and all that you can go
onto this all right so for the purpose
of today's uh uh workshop I will start
with an introduction
and then I will um come here okay and if
you feel that uh this is not something
that you are interested in or this is
not something that you are looking for
you know feel free to uh uh you know uh
go to some other place because I don't
want to waste your time uh uh but I want
to make sure that if you are here for
next 1 hour you learn something new uh
with respect to what you already know at
this point in time okay and if you have
any questions uh feel free to ask so
that's that's the other thing okay let
me just expand And this is little too
big uh here. So
uh okay. So I noticed that most of you
are aware of rag. Uh but we are going to
talk about uh multimodal rack for a
moment. Uh just to set the premise and
then we will uh get into the vision
based retrieval. Okay. So if you think
about uh multimodal rag uh
what we essentially do is and this is by
no mean is the only architecture. This
is just one of the architecture. There
are many different ways that you can do
multimodal rack but in general this is
what uh we have been doing and still
today we do. You take a data and that
data will contain images, text and uh
tables.
The first thing that we do is we use
some framework or
is that so bad or
okay thank you. So so we can use any
framework of our choice or you can write
your own custom script or you can use
any managed service like text OCR based
technique. The idea is you extract the
images, tables and
uh uh text out you know separately. You
can have some metadata uh to make an uh
hash which tells you that this image is
coming from which page and all that. But
essentially you divide all these three
separately. Then you can use one
multimodal embedding model and this
multimodal embedding model can take any
of these three entities because it's
multimodal. It can take any of these
three and when I say multimodal you can
think of it like input can be
multimodal. Okay. And then it will
generate some vectors for images. it
will generate some vectors, tables, text
and all that. Then you go to the
database any vector database and store
all these embeddings. So what you are
essentially storing here are the actual
embeddings of text, tables and images.
And then comes the retrieval part. When
you ask a question any raw question like
raw text, it goes through the same
embedding model. Then it is first
searched here. It will get some relevant
chunk which could be again image, text
or table and then you take those chunks
along with your text and use a
multimodal LLM. Why multimodal? Because
your relevant chunk can be images, text
or table, right? And then you get an
answer. So this is one approach. The
second approach is you do the same thing
like this part is common. After that you
used a model which will just generate a
summary of all this separately. So it
will use uh you can think of it like a
summary of an image is nothing but image
captioning right. It will generate an
summary of this image summary of the
table summary of the text. Now all you
have is the summary. That means it's all
text. So now you can use any text based
embedding model to generate the
embedding of the summary. And then you
store this embeddings here. So what you
are storing here are only the embeddings
of the summary not the actual data.
Right? And then when the question comes
now we are talking about the option
number two. When the question comes we
do a semantic search with the database
and what we get as a chunk or some
summary. Now that summary could be a
summary of an image, table or text. We
don't know whatever it is.
But whatever we get both of them are of
text format. So that's why we can use a
general textbased LLM to generate the
output. Okay. So that's uh option number
two. The option number three is exactly
same as option number two. Uh with a
slight change here when you store the
summary you also think of it like this.
You have a hash here. Let's say a
dictionary which says that u this image
number one uh summary is this image
number two summary is this table number
one summary is this you create a hash
file or hash any data structure of your
choice so that you can you know come
back later on from a certain summary and
you can figure out this is summary of
what entity okay but you store only the
summary just like before but the
difference here with respect to option
Number two is when you ask a question
you get some relevant chunk which is a
summary. Then you go back to that hash
and find out the actual data not the
summary the actual data which is mapped
against those summary and then you take
those actual data and then pass it on
here. So what you're doing is summary
you are using here just to reduce the
search space just for semantic search.
When you get that relevant chunks you
don't care about that summary. You just
take the original data from that hash
and then you take those relevant chunks
and your question and since relevant
chunk can be again text image or um
table you need a multimodal LLM okay and
you generate the answer are you with
yeah
>> um let's say you have a table but the
table is an image
>> which one would you prefer to use in
that case
>> yeah so that's a good question so the
question is If you have a table which is
an image right. So now it's all whatever
you are saying is all at this level. So
you don't you it's all up to you how you
segregate these three entities. So let's
say you use an OCR based uh uh technique
and let's say it it's identified a table
as an image. So it will be treated as an
image because the model till this point
the model has no idea from where these
three are coming because these are like
prerequisites for this particular
pipeline. Okay. So are you with me with
all these three approach? Yes. Yeah.
>> This one.
>> Oh okay. So these are nothing but uh uh
models basically.
>> It it doesn't resonate. It's what I
understand now but
>> yeah yeah yeah right that's correct yes
so it could be this is any so here we
have a multimodal embedding model right
here it's actually first you need a
model which will generate the summary
then you can think of it like another
model which will generate the embeddings
so uh it's just a one icon but think of
it like there are two things happening
in sequence okay are you with me so far
yes okay so do you see any problem in
this when you have a multimodal data
there's no problem as such buth there
are few scenarios when this may not work
okay so
scenarios like like what you mentioned
uh there are few documents or data which
we have seen where the PDF was created
using images so basically you can think
of like this um let's say uh uh
toll that we uh cross in a highway all
are images right they just take the
images of our number plate and all that.
Uh similarly you can think of uh any
government organization where forms are
just uh they they just keep on uh taking
images and later on all those images are
converted into a PDF. So in that case
not always uh these techniques of
extracting images, tables and text works
nicely. It's not like uh it never works.
It all about how your data behaves with
your technique that you are
implementing. Okay. So now the next
technique that we are going to discuss
today is using a vision- based uh
retrieval model and we will see that why
we are using this but the premise is
this. If you use if if if your data with
your data any of these uh three options
works you just go with this what we're
going to discuss in next one hour you
it's not relevant for you but this you
know what we are going to discuss is an
option number four which is uh a smarter
technique which is based on a vision
based model to uh perform the retrieval.
You don't have to extract all these
three entity in the first place because
think of it like this. The moment you
have your data and you first in the
first place you segregate these three
things. It's just like you have a family
you just uh you know let your kid go
somewhere you go somewhere and you know
your partner goes somewhere else. It's a
good thing but uh you know uh if all of
them goes you know separate and you
expect somebody else to identify that
they all are part of one family it's a
it's a task right so for that external
person
uh so that is what we are going to solve
that can we can we come up with a
technique where we don't do all this
okay so before we go uh I think you have
a question yeah
>> yeah I think you kind of answered my
question because you were explaining the
case about scanning all the PDFs
>> and it wouldn't quite work and I was a
little bit confused as to why these
approaches wouldn't work.
>> Yeah.
>> But then I think you're going towards
the notion that we need to establish
relationships between
>> Exactly.
>> Absolutely. Yeah. So I think I'll give
you one more uh one more uh example. Uh
if you go to IKEA, you buy something
from IKEA. If you have seen the IKEA uh
in uh you know instructions you know we
don't I personally never looked into
those instruction but while I was
reading that research paper they said
that refer to that uh because we
generally go to YouTube and search what
is the instruction steps and all that
right but if you look at the IKEA
instruction set they just have emoji
kind of a human uh and they are just
assembling something there is no text
there there's nothing there so unless or
until you have a visual understanding of
what it you will not have any idea what
they're talking about. Okay. So there
are some data sets and I'll show you a
few of the data sets where the uh there
are some text embedded within the image
and there are just an image they don't
have any text. So you need some model or
some uh technique which can help us to
understand uh what is the semantics of
the data. Okay. So let's see how we are
going to solve this.
So this is uh again the text might be
small uh you can just leave that okay
you can open it in your laptop uh or
I'll try to explain it as much as I can.
So this is the traditional technique
that we discussed right you first place
you divide all these three entities uh
separately uh but this is not very
helpful because if you look uh you know
think about it let's say you were given
a book and uh you were asked to
answer a particular question let's say I
give you this book by the way this is a
fantastic book from Simon have you heard
of this book yeah if you are getting
started with machine learning uh deep
learning uh you know you and uh read
this book. This is a really fantastic
book. It's a recently published book and
professor Simon is very reachable. It's
it's uh it's a fantastic book. So let's
say if I give you this book and if I ask
you some question and let's say you are
not aware of this particular topic, you
will not uh go and scan the entire book.
What you will do is you will try to
first find the structure of the book.
Maybe you will find the index where is
the index, where is the appendex and all
that and then you will try to uh figure
out which chapter this book might uh
this question uh can be answered from
and then you will go to that specific
chapter and then read through those
chapters. Right? So that is what a human
will do and that is exactly the
philosophy of uh calling. Okay. So
when you get a question
what you do is first you will first scan
through the appendix and all that and
then you will figure out where exactly
uh uh the portion where your question
can be answered and then you will
accumulate all those relevant chunks or
relevant information and then uh finally
you will come up with a response. Right?
So this is where uh or rather this was
the motivation of this vision-based
retrieval model called pali. Have you
heard of this model call pali? Yeah.
Okay. Yeah. A few of you. So call pali
was introduced I think in uh July 2024
just less than a year back. Uh and the
motivation is we will treat every page
as an image. So assume that you have a
PDF document of let's say 100 pages.
Your data set is not one PDF but 100
images. Okay. There is no concept of
retrieving u images, text and tables
from there. So how it works?
So
it first creates patches
of every page. So now let's consider one
page. One page is nothing but one image.
And
the same will apply on all the pages
that you have in your document. The
first thing that it will do is it will
create some patches. Uh in the paper I
think it was um the model was trained
with uh 32 patches like 32 + 32 here.
How many patches we have? We have 1 2 3
4 5 uh 15 patches. Right?
Now what we do is after that once you
have those patches you will use this
call model embedding model and it will
generate
one vector per patch.
Okay. So in this document how many
vectors we will have?
>> 15. So now if I if my document is having
10 pages how many vectors I will have in
total.
>> 150. Okay. So now what we going to do is
we are going to see this middle part how
it generates the embedding and then at
later and in last section we will see
that how it does the retrieval and then
we will go to the code. Okay. Okay. So
before we get into that um uh this
embedding process let's take a detour of
vision based language model. Uh have you
worked on a vision based language model?
Any of you? Okay. Okay. Few of you. So
ultimately if you think about it um we
had uh
language models uh I'm not talking about
uh uh large language model but text
based models uh since we had transformer
based architecture right and then at
that time we also had uh models which
can work pretty well with images which
are based on CNN's
now What researchers thought was uh now
that we have a language model why can't
we just uh make use of that to work with
vision it could be images or videos are
nothing but uh images but but with a
time stamp right another dimension you
can think of it uh and then
what people did was they took some
vision based model and then they took
some uh text based model they both are
separate right At this point what we're
talking about is uh before the training
right like the these two models are
completely separate they all have you
know they they are in different space
basically right and the idea is
at the end of the day come up with a
model where if you send an image of a
dog the vector that you will get and if
you send a text about dog the vector
that you will get at the end those two
vectors will be very close to each
other. Initially it will not be close
because when when I say a dog is sitting
on a field and if I use any text space
model it will generate a vector and for
the sake of simplicity simplicity let's
assume the vector dimension is 10. So it
will generate a array of 10 numbers.
Similarly when you pass an image of a
dog it will generate an im vector final
vector with 10 numbers. Let's say the
embedding vector size is 10. Now those
10 numbers and this 10 numbers for the
text they will be anywhere in the space
because they don't have any correlation
at the before uh the training. Now what
happens is
at the time of training we take lot of
samples positive samples where the text
is there uh text is there which
replicates the image and there are lot
of uh negative samples where image is
there but text is something random and
the idea is the loss function that we
use is if they are similar
we want to make sure that the loss is
less but if they are orthogonal or very
separate we will say that Okay, the loss
is high and during this loss you know
this training process we kind of
optimize and at the end we we see that
when you send an image or a text the
embedding that we get at the end are
very close to each other. Okay. So we
are not going to deep dive into u vision
based model but there is something
called contrastive learning where if you
send an image and a relevant uh positive
tag and if the vectors are very uh very
sparse like very uh uh very much apart
from each other then the loss will be
very high because we want these two
vectors to be close right. So that way
uh during the uh back uh uh back
propagation we update the weights
accordingly. So this is one of the
reason if you think about it uh you
might have seen in language models or
when you use any uh let's say any um
foundational model they say that always
your prompt should be
uh about what you want not about what
you don't want. Have you seen this? If
you are into prompt engineering why they
say this just think about it. Let's say
if I say uh uh okay let me give you an
uh analogy right if you are going for a
dinner right with your wife and if you
ask your wife what would you like to
have she will say that I I I let's say
uh I don't like this I don't like that
but that was not my question my question
was what you want that is always
difficult to uh answer right people will
say that uh
Okay, would you like to have this? No, I
don't like this. But when you ask that,
okay, you tell me what you like, it's
very hard. So that's why when you give a
prompt that u I want a dog sitting on
this chair, it's a very nice prompt. But
if you say that dog should not sit on
the floor, it can generate any image
because you are not saying that it
should sit on a chair. It might be
sitting on a desk or somewhere else.
Right? So that is the reason we always
say that the prompt should be uh very
much positive what you want not the
negative because uh you know data set
doesn't have that much of negative
samples. Now let's come back to call
paraly how it works. So
you give an image. So now this image is
you can think of it like one of the
patch okay it goes through uh the uh
vision based encoder and it will
generate an uh embedding and then we
have a linear projection and the reason
that we have the linear projection is
because at the end of the day uh when
you ask a question that will also be
generating some vector we want to make
sure that these vectors are compatible
to each other they are of same size and
that's why we have added added a new
projection layer. You can simply think
of it as a fully connected layer and
ultimately you will have a standard
transformer and then you will get the
output token. Okay. So
now
if you think about
me just scroll down. Yeah. So if you
think about call pal when you give an
image
it will have let's say in this case
there are 15 uh patches just think of
this patch okay this patch will go
through this it will generate an uh
vector and this will be the final
representation of the first patch.
Similarly when you give the whole image
you will not give the uh you know single
patch uh in the batch you will give give
one full image or let's say page number
one of the document this model will do
all that patching and it will finally
generate one embedding vector. Now at
the time of
and if you if you see here in this case
this is grayed out because now we are
talking about after training once that
model is trained after training you will
create the embeddings of your document.
So while you are creating the embeddings
there is no question here right. So it
will just use this path. Now once those
embeddings are done like once you get
all these final vectors for your entire
document
in the query time you will just use your
text based query. So call pal doesn't
say that you can query with your as an
image like in chat GPT or any GPD based
model you just upload an image. We are
so lazy we don't even ask the question
these days right we just upload the
image and you know model just generates
something. So here the question should
be always in text that's the
prerequisite for this model and then
this goes through the same uh model and
then it finally gives you an response.
Now this response this vector you will
do a semantic search with the vectors
that you have stored in your vector
database using those uh image patches
with me so far? Yes. Okay. So if you
think about it uh
both for query as well as uh your
embedding there is a certain amount of
uh uh pre-processing that is needed
because uh your images can be of
different size right so let's say you
have an a PDF document u and
the tool that you use to convert that
into an image uh it created an image of
800 by 800 but let's say somebody else
have used another technique and the
image was of 50 + 50. So we need to make
sure that the images are of standard
size, right? So that's why when we look
into the code next, you will see that
always before it actually generates the
embedding, there is a pre-processing uh
that we do. Okay. So let's uh go to the
code and see that. But before that let's
uh let's share I mean let's talk about
how it generates uh the similar chunks.
So this is the most important part of
call pali. Okay. Now imagine that
your page now just consider page number
one of your document and the page
and the patch size that you use is let's
say 2 +2 that is total four patches.
Okay. Now let's say this is page number
one and this is the embedding of your f
uh first patch. This is the embedding of
the second patch. This is the embedding
of third patch. This is the embedding of
the fourth patch. Okay.
And you ask some question. Let's say uh
what is AI? Just for the sake of
simplicity what is AI? And you have used
through uh it went through the tokenizer
and it generated three embedding vectors
right three tokens basically. So now
what we do is we do a dot product
between each vector and each vector of
all the patches.
Okay.
And then
for every row we try to find which is
the maximum number. What this number
signifies? This 89 signifies 89
signifies that the first part of your
question has the maximum similarity with
the second patch of the image.
Right? Similarly, if this is 97, that
means the second part of your question
has the maximum similarity with the
third patch of your image.
Right?
And at the end what we do is we just
take the addition I mean we just take a
sum the maximum numbers of each rows and
if let's say that is 2.58 that means
this query has a score of 2.58 for page
number one. Similarly we will do it for
all the pages and then in rag what we do
at the end when we do a semantic search
we say top five chunks or top 10 chunks.
So in this case chunk is nothing but
pages. So if I say top five then in that
case it will show us the top five pages
based on this score.
Getting it. So this is the most
important thing. So this is called late
interaction. Have you heard of late
interaction embeddings all that? And the
reason that we say late interaction is
because these token embeddings are
already stored. We have already done
that. It's there in your vector
database. All we have to do is we need
to just do the dot product and then use
this metrics to generate the top five or
top three uh pages. Okay.
Uh with me so far? Yes. Okay. Now this
functionality is not supported in all
the all the vector databases. We are
going to use one of the vector database
called quadrant. Have you heard of that?
But there are few other databases. I
have not done enough research which are
the databases that it supports. uh but
this u maxim calculation is not
supported by all the database okay there
are some open source contribution that
we have for few of the vector databases
I I I I tried with open search u it did
not have but I think there is a
extension uh which you can use to make
this functionality okay so now we are
going to get into
uh the demo so just like what I said
once you have those uh scores like in
this case 2.58 uh like this you will
have for all the pages in your document
and then at the end you can pick the top
three or top four pages of your choice.
Okay.
So now see this so far we are not
talking about agents. Okay. Because
that's a very simple task. Uh we will
just wrap this with an agent at later
point in time.
All right. So let's try to do this.
Okay. So this is an uh uh I I'll just
come to this uh image later on. So now
let me just increase uh the font. Can
you see this? Yeah. Okay. You don't have
to read all that but just uh you should
have an idea what we are doing. So first
we are just importing few of the
libraries. Uh I have no idea what I have
importing but uh there are few right. So
I think it's uh where is the call pal?
Yeah. So this is the call pali model
that we have. Okay. And this is the
quadrant database and this quadrant
database we are going to run locally in
a docker container. Okay. So if you are
planning to run this uh make sure that
you have docker installed in your uh in
your laptop. Okay. I I I think the
readme have all the information.
Okay. Uh so first we need some
data. So I have used one data set
basically it's a small textbook and if
you see this textbook uh this is a
science textbook
uh chapter number 13. So we have let's
say see one of the thing that uh which
is interesting here is if you see this
image there is no text here right so
it's if you ask anything about this
image uh and use a traditional technique
it might not answer properly uh like
this this is also another image along
with some text and uh you know you can
pick any data set of your choice but uh
this is the data set that I have okay
and feel free to use any data set of
your choice but uh for the purpose of
this uh demo you may like to download
one of these uh PDF from this URL and
play around with this and then you need
to have a hugging phase uh uh uh token
uh because we are going to download this
model from hugging phase right so you
should not do this right so this is uh
you know I was just trying this because
without creating av file but you should
have an env file inside that your token
should exit exist. Okay, so this is uh a
token like this is not my token. If you
see this uh this is just a dummy one,
right? This is not my token. Okay, so
it's but uh this is this is just uh the
the hugging face token that you should
have. So here we are just loading and
logging into our hugging face account.
And next we are trying to check whether
we have a CPU, GPU or uh MPS. In this
case it's a MacBook so I'm just using
MPS here uh as a device. Since it's a
vision based model it's better to run it
on a GPU. It will be faster but you can
very well run it on CPU. That's fine.
I'll tell you uh you know you should be
a little cautious about this if you're
running with a within your laptop uh on
CPU.
uh
if it's a office laptop no one cares but
if it's your personal laptop make sure
that the batch size is very small
otherwise it will it will crash in fact
I when I first ran this I did not check
uh the processing time and all that it
just uh went on u you know crashing and
it uh rebooted my laptop
uh and uh I I did not even read through
all this and I ra the IT ticket and I
actually got a laptop new laptop uh So
but it was my fault here. So they
thought that my work my work needs a
laptop with more memory. So so if you
are finding out tricks to get a new
laptop from a company. So this is the
cell. So okay uh you can try that. I
I'll tell you what you have to change to
get a new laptop. So just increase the
batch size to batch size to 12. It
should work fine. Yeah. Yeah. Okay. So
this is the model that we are going to
use. It's a call pali uh version 1.3.
There might be our new version but just
have a look. I I checked last month it
was still 1.3.
And I'm having a model and a
pre-processor. Remember that we
discussed that we need to have a
pre-processor first. Uh we will process
our data and then we will use the model
to generate the embeddings. Okay. the
same model but there is a pre-processor
and the model and these all are coming
from hugging face and we are using a
cache directory so that we can load this
model locally in our uh local directory
uh so that every time you run it doesn't
download from the internet okay
and once that is done uh you have to
have a vector database so if you have a
docker installed you can just copy and
paste it it is nothing but it just
created a a container with a port
forwarding and uh there is a folder
which gets created locally uh as a
storage. So all your vectors will be
stored locally in your laptop. That's
all. And if you click on this dashboard,
you should be able to see uh that uh UI
of that. And if you come to console uh
sorry um here collection initially you
since I have executed that code that's
why we see this but you should not see
anything here and as you run through the
notebook you will see the uh collection
here. So collection is how many of you
are aware of databases?
Okay. Okay, many of you I have no idea
what what it is but I just asked. So the
collection is basically you can think of
it like a database and where you will
just store all the uh schema and all
that. So I'm creating a quadrant client
and this is something that I imported
earlier and this is the local host and
port number this and this just we are
just creating the setup right. So now we
have a vector database and we have the
data. So and we have also uh downloaded
the model. So now the second thing that
we need to do is we need to create a
collection right and if you see this uh
we have a collection called u class 10
science. Uh so you can give any
collection name. Here we are mentioning
what should be the vector size right? So
uh this is the uh embedding length. So
here it is 128 and in this in this code
what we essentially doing is if there is
a collection already exists it will not
create any new collection or else it
will create a new connection. Yeah you
have a question.
>> Yeah.
>> Yeah. So there is a
let me ask you this. What do you feel if
I increase the embedding size from 128
to 256?
What do you feel? H how it would behave?
Just a guess.
>> Mhm.
>> Okay. Okay. Let let me let me give you
an example. Okay.
Let's say I I I I have just come here
and I mentioned you two things about me.
I work for Amazon
and
I'm married. That's all. Okay. These are
the two information that you have. Now
if he asks you some question about me
that uh okay uh tell me someone plays
cricket or not. Will you be able to give
some answer? No. You will be giving some
answer based on these two information
but it will be random. Right? But now
let's say if I give you more
information. I am Suman. I work for
Amazon.
I am married.
I have one wife as of now. I let's say I
have one kid. Okay. And few other
things.
So if I keep on giving more features
about me, you are having a richer uh
information about me. So now if he asks
you a question it's more likely that you
will be able to give a uh you know you
are you'll be able to give a more
accurate answer. Same with this the
moment you increase the embedding length
you are it's not about chunk and all
that it's just about how much granual
information that you are having about a
specific thing.
Okay. So you can always embed uh any
entity with just one number a vector of
size one but it will not have much of an
information as you increase the length
it will it will be more richer. Okay.
Okay. So coming back to your question uh
in the documentation I think they uh
they said that 128 is a good number uh
but you can always use 256 right if the
vector database also supports that or
the embedding model. Okay. So,
so this is where we are just creating
that collection. We have not even uh
started creating the embeddings and all
that. And see this here uh this is what
u I was referring to when I said
quadrant supports that matrix multiplic
uh that u uh late interaction thing
right so it says I'm setting some
configuration that it's it should have
multi vector configuration and multi
vector comparator as maxim. So maxim is
what uh helps us to get those three
numbers from that matrix and then add
those three numbers and give us the
final value of the your query and each
page. So that at the end of the day what
do we want? We want the relevant pages
uh based on our question, right?
Okay.
So now once this is done, it's now
pretty simple. I have to uh first create
the embedding. But before that, I need
to create convert my data into images.
And that's what uh this uh this function
does. So what it does is it it takes you
it takes an uh directory and you can
have hundreds of PDF files and it will
go through all the PDF files and it will
create uh images of each pages. Okay.
And not only that, it will also add all
of that into an list called all images.
And this is just for my own housekeeping
with some metadata like document ID,
page number and the actual image in the
form of RGB. And it will store it in a
local directory called PDF data. And if
you just see the first two entries, you
will see that okay, this is document
number zero that is let's say I just
have one PDF. So all the entries will
have document ID zero, page number zero
and this is the image page number one
and this is the image. Okay. So this
data set contains everything with me so
far? Yes. Okay. Great. Now that I have
this uh uh images, I can use the
embedding model to generate the
embedding and
and this is where uh you know I just
crashed my laptop. I initially used a
batch size of 10 uh 12. So it took a lot
of memory and I had just I think 16 gig
of memory. So it it actually crashed but
uh if you're trying in your laptop make
sure that you use start with two or
three. So it basically means how many
images you want to process. And now here
we are generating the embeddings and
first we are going through this call pal
pre-processor which will just uh
pre-process the image in a standard uh
size and then I'm passing it through the
call pali model which actually generates
the embedding. So this will have my
embeddings.
And once I have all those uh embeddings,
what I want to do is I want to store it
in the vector database. And that is what
I'm doing it here. I'm just inserting
into the collection that I have created
for all the points. Each point is
nothing but you can think of it like uh
each vectors. Okay. And in this case I
have just 10 pages. So it will just
generate the amount of number of
embeddings for those 10 pages and it
will store it here. Now is the final
thing you know how we can retrieve. So
see this I have just asked this question
what are the different uh tropical
levels because this is there in the book
and this question also need to be uh
need to go through that embedding model
just like images. So I will do go I'll
make that through the pre-processor and
the model and once that is done I will
do a semantic search from the vector
database and that is what we are doing
we are just querying uh the vector
database with our query token and I'm
saying that the limit is five what this
limit five means that means I need the
top five pages which is relevant to this
uh question
and at the end you will find some five
pages is and if you want to see those
five pages how it uh you know how it
looks like uh you can actually
visualize. So this is just a wrapper
python function which will just take all
the images and it will just generate u
the images in a pictorial format. Okay.
And in fact if you see uh the this image
I think this was uh this was the image I
guess uh yeah so this is where the
tropical levels are mentioned it
actually identified based on the
question and the uh call embeddings
right so this is the page and also there
are other pages which we got now comes
so retrieval is done right so call pal
just talks about retrieval it's its job
ends here. Okay. And uh if you if you
think about it
uh with respect to uh sorry with respect
to this
sorry here I guess
in the traditional technique
we came to this point right
uh we came to this point sorry
we came to this point when we got the
retrieved images and the question is
already there. Now we can use any
multimodal LLM to generate the answer.
Right? But we have skipped everything
here. Right? So
now when we when we use any generative
model you can use any generative model
of your choice. Uh if you don't have any
AWS account or if you have any other uh
model access you can always use that. uh
but let's say uh you don't have bedrock
access so we can use have you used just
a local model the response may not be
that great but you can work it out right
so this is again a wrapper function just
to uh convert all the images uh into the
format that the model expects because we
are we need a multimodal LLM right so we
will take some multimodal LLM from Olama
but depending on what model you're
using. The model will ask you to have
the input in a certain uh format, right?
So it needs the data to be in base 64.
That's what this small tiny function
does, right? Uh and then we just say
generate and this is the model that I'm
using and I'm sending the query and the
image that's all right. So see this now
I'm sending the full query, not the
embedding of the query because Olama has
nothing to do with that embedding of the
query. that embedding was needed just
for semantic search right and then uh we
get some response if you want to use
bedrock then you should have bedrock
access how many of you know about
bedrock
okay perfect so it's just a managed
service on AWS through which you can
access any different I mean different
kinds of model and
the way that bedrock expects you to give
the input uh multimodel input is little
different and that's That's why we have
some wrapper functions uh which will
which will make your prompt you know
according to the multimodal uh models
requirement right and you can go through
these two functions it's standard uh you
know uh converse API that we have used
so nothing fancy here so I don't want to
go there because that is not the purpose
of this uh problem but ultimately you
give the images and the query and you
mention the model ID so in this case I'm
using set uh claude sonnet 3.7. You can
very well use sonet 4 if you would like
to. And you generate the image uh sorry
the final response. Okay with me so far?
Okay. Now comes the agent thing. How we
can make this agentic? So it's very
simple, right? You don't have to go
through all these things because what we
have done is ultimately what we want
when somebody is asking a question we
want an agent to to retrieve the
shortlisted images and give it to me.
That's all. Right? So we have seen how
to shortlist those images. Right? What
we are going to do is
uh here
I'll just go through that later. Yeah,
what we going to do is we are going to
create a function called retrieve from
quadrant which will just take your query
and if you see the uh uh return for this
are the matched image paths that is what
we want nothing else right and the code
here in this function are the exact same
code which has gone through in multiple
cells you know previously it just it
does the same thing and now to make it
agentic
I have used a framework called strands.
Have you heard of strands? Right. Okay.
So strands is a new agentic framework.
Let me just show you this. It's
strandsagent.com.
This is a uh SDK which was launched by
AWS. Uh I worked with on at the launch.
There are some YouTube video as well. Uh
you can just go over and just search for
strands agents. You will find a launch
blog as well. Okay. But basically it is
very very simple just to give you an
example how to get started with strands.
Uh you just pip install
and uh do you want to see a quick demo
of strands before we go to that part?
Will that help? Yes. Okay. So let me
just show you. I think I have that.
Okay. So let me quickly spend four
minutes on that. Four five minutes. I
have a good demo actually if you want.
How many of you have heard of um three
blue one brown?
Okay, perfect. Okay, so then let me show
you that you might you might uh it might
be interesting. So strands is an uh
framework very simple. It's a model
first framework. So we are just taking a
pause on that. Okay, we are just we will
whatever we learn here we will just use
this framework to make our workflow
whatever we have done agenting and we
will add a voice part of that as well.
So here
there's an open source framework which
is model first. That means uh now the
models are so strong we expect that the
model should reason rather than we
telling uh the agent with a lot of
backstory goals prompting and all that.
We don't want all of that. We throw a
question. We expect the model to
generate the response and do the
reasoning uh on on the model side.
That's that's why this is very very
lightweight and it has the integration
with different models and you can use
model from bedrock you can use directly
from entropic you can use light LLM.
Have you heard of light LLM? Yeah. So
when you have an access to light LLM you
can access any model that light LLM
supports. Right now this is what it is.
So strands you by the definition of
strands it's a DNA structure and it just
have two strands and that two strands
stands for model and tool that's all. So
you make an agent with one model and few
tools and simply you just ask the
question that's all. It's as simple as
that right and let me show you uh one
quick demo.
Let's see if it uh if it works. Okay.
So this is is it visible from you know
last row? No, not that much. Right.
Okay. I'll just read it out. So we are
just importing um agent and we are
importing the tools. Okay. And
okay, this is I think this is the MCP
one. No, no, this is not the one I want
to show you.
Okay, let let's uh see this.
Okay, just uh uh I think it's a video.
It should work fine.
Yeah,
let's see. So, we first install strands
agent and strands tool. PIP install
simple pip install
and it's open source. Okay, so you don't
have to uh and it supports Olama as
well. So, you don't have to have an AWS
account or anything of that sort.
So what we are going to do is we are
going to create a file create a summary
write the summary into the into our file
and uh also add a voice part of it.
Let's see I think it would be pretty
quick.
So we are importing is it visible or I
should should I show it? It's pretty
straightforward.
So we are importing agent and we are
importing the bedrock model. By default
it uses bedrock model. Uh it actually
uses clot 3.7 but you can use any other
model. And I have used some built-in
tool called read file, write file and
speak. And this is the model ID. And
this is the prompt. You can have a
prompt, you can skip a prompt, doesn't
matter.
And lastly you have to create the agent.
So
say this agent contains the model ID
system prompt and the tools and all
these tools you have not written the
code for this. This is by default right
and I'm just asking uh a particular
question and see this I'm in the prompt
I'm saying that this is a textbook uh in
my local directory read that create a
summary and write it into the local
directory and also speak out the final
answer
and see this it is using the tools for
read the file second it is create after
>> functions like a camera light entering
through the cornea and focus by a lens
onto the
>> we have not done anything just in
controls pupil size to regulate incoming
light
>> the I can adjust focal length through
accommodation see
>> all right so now I I will share one more
thing now I'll not show you the code
that is not the purpose of this but have
you heard of um of course you have heard
of MCP
yeah so see this I I'll not tell you I
think you should be able So
I've created an MCP server called um
created an MCP server with manm okay so
manim have you heard of manm okay just
just see that so idea is so let me just
show you what we are doing we are
creating a man server and this is the
MCP server now this is the client on
nothing but uh our strand agent and this
will call this MCP server. Okay. And I
can give any question. So question is
create a man screen which draws a cubic
function like 2x^ 3 minus blah blah
blah. Okay. And see what happens.
Now it is executing the code calling
this uh MCP server. It is working uh
here and then it should give you some
response.
So it generated this video. Okay.
And now you will get some familiarity.
Looks similar, right?
I have not done anything. All I have
used is a manage uh uh SDK and created
that uh MCP server which can generate
videos like uh what three blue one brown
created. So this is just a small demo of
how you can make use of strands with an
MCP and write simple code and you know
do wonderful things. Okay. All right. So
this is about strand.
The core idea of strand is uh just pip
install and use it with the by default
tools and uh your model of your choice.
That's all. There's nothing uh no
scaffolding uh beyond this. Okay. So
it's just like this you pip install
create an instance and just ask
question. Here we have not mentioned any
model that means it will by default use
bedrock model but in the demo we have
seen that you can define your bedrock
models here. Okay. So now let's come
back to our our um
our problem. In this case our tool is
not the default one but the tool that we
have defined. And what is that tool? The
retrieval tool. And how I can create a
uh custom tool it just by importing tool
and just use that as a decorator on top
of your function. That's all. Now this
becomes a tool for me just like read
file write file speak. This is just a
tool for me. Okay. And we can defi we
can use make use of bedrock model or up
to you. And now look at this. We are
also importing an image reader. Why we
are importing this image reader? I will
uh tell you a little later. But uh you
you remember that when we use this
bedrock model for final answer when we
use this bedrock model to generate the
final answer we created some custom
functions which are nothing but uh
contains the information about how to uh
create the prompt for your images for
bedrock models right so I don't have to
do all these things and uh I can simply
make use of this image reader which just
takes an image and generates uh the
prompt for us. And now I have a system
prompt. System prompt says that you are
a rag based system and all that. And it
also says uh these are the two functions
that you have or the tools that you have
to use and all that. And that's all.
And now you create an agent again just
like before you define the model system
prompt. And in this case we use two
tools. One is the retrieve from quadrant
which is our tool and the image reader
for the generation part. Okay.
And then we ask this question what is
the difference uh different tropical
levels and now it just agents uh
generates the response just like before
but now everything is done by the agent.
And the beauty is let's say now you want
to add the voice feature. I don't only
want the answer but also the final
response in the form of voice. So far I
have done this. I'm just reducing the
image so that everything fits in. So far
we have done this. We ask a question. It
goes to a strands agent. It uses this
retrieval tool custom tool that we have
created. It gets the relevant chunk
which are nothing but the shortlisted
pages and then
it uses any of these models let's say
bedrock colama whatever to generate the
final response and to generate this it
uses this image reader tool. Now what we
have to do is to add voice
functionality. I will just use the speak
uh tool. That's all. Just one uh import.
Okay.
And
that is what we are doing. We are just
adding speak here.
And again the system prompt remains the
same. And I'm quering the same thing.
And now when I ask this question. So
let's say let's ask this question. Okay.
So let me run this
and let me let me just ask in the
question itself explain the answer over
a female voice in a natural way.
And let's see.
And now let's run this.
I hope I'm connected with the internet
but let's see.
So when you run this uh code in your
environment you can simply remove the
system prompt you will still get the
right answer. In fact try this prompt. I
have not tried but try this change this
prompt and say that mail voice or
something like that right robotic uh uh
way of uh you know not a natural way
maybe robotic way something like that.
The idea is see that whether strands is
able to
you know forward that information to the
model or not right so you don't need a
system prompt uh it may be because of my
internet uh but it's it doesn't take
that much of time you just give it a
shot it should work fine okay so that's
what I uh I had uh for for uh this
particular uh workshop
uh I would okay it's now running so it's
little slow
but let me
so it is now able to generate the images
I mean shortlisted the images and now it
should speak in a female voice
so while that happens Okay,
>> trophic levels are the different feeding
positions in a food chain representing
the flow of energy through an ecosystem.
There are typically four.
>> Okay. So, let me just stop this and
let's say if I let's try this. Okay. Um,
let me delete this system prompt
and
let me just have this model and the
tools. There is no speak, nothing, no
system prompt. There's nothing there.
And here I will change this to a male
voice. Okay.
And uh I'll give it a shot. Let this I
don't want to interrupt.
>> Which are small carnivores that eat
herbivores. These might include frogs,
small birds or foxes. The fourth trophic
level is occupied by tertiary consumers
or top carnivores.
>> You can in fact say something like
summarize in uh you know 50 words or 100
words rather than waiting for this to
complete.
that energy transfer
>> still going on
just give it a second
and
before I forget
if you want to know about that
multimodal that the traditional
technique in this GitHub repo there is
the part three and here you will find
the details of uh that architecture like
um uh this architecture right and you
know this notebook is about how you can
do the same thing but uh pre-processing
this image text and table okay so just
play around this GitHub repo
okay it's done so now I will quickly
uh
I just created this agent now but uh
without any uh system prompt Now I just
executed this and now let's run this. So
now I am letting uh
the agent
know only about the models nothing else.
There is no system prompt.
I just hope we get a male voice at
least.
>> Let me explain trophic levels, which are
essentially the different feeding
positions in a food chain or ecosystem.
Think of them as the levels in nature's
dining hierarchy. Starting at the base,
we have the producers. These are main
>> Okay, let's see if I have
boys.
Let's try this.
Trophic levels are essentially the
different feeding positions in a food
chain showing how energy flows through
an ecosystem. Let me walk you through
the main
at the very bottom we have the
producers. You can try this out and see
what you can mention so that you can
augment the tool. In fact, this is not
the right way to do this because by
default it is a female voice. You can
actually change the behavior of this uh
speak tool. Okay. So the way that you
can do is you can go to the
documentation and uh if you see this
documentation
uh here we have tools and uh you can
see the overview
and if you see this here
there is a tool spec. Yeah, here's a
tool spec for different tools and you
can mention what persona that you want.
So that is a more deterministic way uh
to do that or else you can put that in
the uh system prompt. Okay. So that's
all I have. Uh u if you have any
questions feel free to ask or uh you
know uh you know feel free to connect
and u you know would be more than happy
uh to connect offline.
>> Yeah.
>> Yeah.
>> So have you seen any
is already using this in production and
what type of scaling uh
>> yeah yeah
>> yeah that's a good question so we have
used this in one of the insurance a
leading insurance company where they had
u the images of driver u licenses and
they have the images of insurance
policies and all that and we tried with
different techniques one of the
technique that we used was OCR which
worked fine uh but CalPali was working
pretty
And it was the only drawback which I
have seen with this call pal model is it
is very heavy but uh that heaviness
comes only at the time of data
injection. So when you create the
embeddings one that is done at the query
time it is pretty fast. Okay. Uh but
when you are putting the data at that
time it's little heavy. Okay. And uh uh
I guess if you are thinking that if you
have 1,000 documents each has 1,000
pages, you will do a search among all
those images. That is not how it works.
Because imagine if I ask you the same
question. Forget about all this. If you
use a text based embedding model and if
you have a book of
1 million pages, you have 100 million
vectors and when you ask a question,
does the vector database search for all
the vectors? No, there is a different
indexing techniques that all the
database uses. Same indexing technique
are used here as well. It's just that
now the vectors represent different
thing. Now the vector represent patches.
In the previous case, the vector
represents images or sorry uh a chunk of
text but uh that semantic search happens
very efficiently uh using a different
indexing technique. One of the technique
that we use is I think hierarchical
small world navigation u so where it
uses a treebased uh you know structure
uh it just finds uh the root node uh I
mean it it starts on the top layer it
finds one of the closest node and
whichever node is closest then it goes
down and finds its neighbor so you are
just you can think of it like uh uh you
know in u in computer science we have
tree pruning right so that's what we do
so it reduces the search space. Yeah.
>> So a quick follow.
>> Yeah. So can we see
more companies
this can see this as a replacement for
>> no traditional?
>> Yeah, that's a good question. No, I
don't think this is a replacement. This
is just another technique and this is
also you know
you know it's a space where things are
changing very fast. Right. Um I
personally feel if we get a vision based
model which is more efficient in terms
of computation this might be a good
model. Uh but again this may work for
your data may not work for your data. So
it's all about your data. What I would
do and what I do generally is whenever I
get some problem I try to solve with the
least uh I mean the most cost effective
way or most efficient way. basically
more than the cost. First we have to
find out which architecture works fine
for my data. If that is working fine I
don't why to complicate things and
create images and all that. I will go to
this only when my data set is very much
converted and where you as an human you
feel that I can read this data only if I
look at it. Imagine that you have a PDF
file. For that simple text file for that
you can get the answer from that PDF
file even if somebody converts that into
a text file and give it to you. But
let's say you have a PDF file which
contains mostly images and embedded text
on top of that then you will say that
okay okay don't give only text you give
me the book I will figure out because I
need to see what is the context of that.
So it's just like it replicates humans u
uh you know uh behavior to understand
any data. U so I would recommend not to
start with this start with the
traditional technique because that is
more effective um cost effective and
also it it is less heavy because here we
are storing a lot of vectors for each
page right so but use this when you have
a very convoluted data okay yes sir
>> so I'm trying to get a sense when it's
good when it's not
im into these little squares. Is there
an issue where
you know let's say you're in the middle
of paragraph
two different segments does that cause
problems in practice?
>> Yeah, that's a good question. But here
the model doesn't know that that there
is any chunking or anything that we are
understanding it that way. But to the
model it's just an image and the way
that it creates the embeddings for that
image is by uh doing that those patches
and why the model knows this because the
when the model was trained it's a vision
based model. So when the model was
trained it used to chunk all the
training data set like that and that's
how the you know it has optimized for
that data. For example during the
training time of call pali not at the
inference time during the training time
when it was given an image of a cat and
the text about a cat the cat image was
also chopped into those many p uh
patches. Similarly, when there was an
image of a of a PDF page, it was chopped
with the same uh patches. So that was
inherited during the training process
itself. So we don't have to question
that okay model how you are doing this.
You the model will say that I have been
doing this don't give me advice. I have
been doing this with you know the
plethora of data. So if you if you just
look at it blindly from outside, I also
had the same thought how the model is
going to create an embedding when it
splits a table into multiple chunks.
What is the relationship between one
chunk and the other chunk? How the model
is you know doing that? Later on we
realized that this has been incorporated
during the training process itself.
Initially it was not able to do that
right but when during the training
process the loss must have been very
high.
Right? So, and that's how it has been
optimized. So, one that is optimized,
you don't have to worry about that. And
this is basically if you think about
this this patching and embedding, it's
it's not a new technique.
Uh you know, it was there in lot of
vision based model. Now, we are using it
for retrieval.
So, that's that's how it works.
In fact, if you are curious, I would
recommend I I'll try to do this later
on, but I would recommend that uh try to
fine-tune this model or train it from
scratch if you have some resource in for
a smaller data set. Um and use a
different patch size uh uh so let's say
start with a patch size of four, right?
And uh you know try to see that how it
works.
uh I have lot of assumptions uh on that
but this will give you a lot of clarity
of how uh the semantic search things
work and why that matrix max matrix
multiplication that we have done right
why that is a good technique
uh uh because imagine
you have uploaded your data set is the
attention you all you need paper and you
ask a question about what is positional
embedding now this positional embedding
This text is there in lot of pages
almost all the pages. It should not give
me all the pages. Right? So it should
give me the page where there is an
actual information of positional
embedding is there. Right? And when you
when you think through that you will
find out that the ma max multiplication
that that we have done right that
actually takes care of that that uh it
will just show you the page where all
the tokens of your query has the maximum
similarity with a particular page not
just one chunk of your question with
just one patch of your page you getting
what I'm saying otherwise you know when
you say top five it will give you any
five random pages where this positional
embedding is written. So just give it a
shot.
>> Yes sir. Yeah.
>> Is there any sort of hybrid approach
where you can process
image?
This is something that uh one of my
teammate started to work on uh where we
are trying to use u call along with a
traditional technique and the way that
we are trying to do this is based on the
question that we are getting and while
we are doing the pre-processing and uh
storing the embeddings we are trying to
store uh in a different way like not for
all the data that we are using call pal
just for few data we are using call pal
for the rest of the data we are just
using the traditional technique
But for a particular data set we just
use one single model. We cannot just go
into that okay first five pages of this
document we will use call pal the next
five pages we will use the traditional
technique that's not how you know uh we
are exploring but we are kind of trying
to use two different approach in the
same uh uh architecture. But this is we
are using because the data set that we
got from the customer they started off
with a requirement certain requirement
then it changed. It changed means it
appended and now when the new request
came the data set is completely
different but they want a one unified
system. So that's why we are just
checking the question is coming from
where and we are storing some metadata
to identify this question should go from
this space or that space. Uh but nothing
beyond that that I have seen. I've seen
either this or that.
>> Yeah. Yes sir.
>> Did you have to find
poly model to for it to work well?
>> No I have not done that. So this is
these are all fine-tuned models. You can
just make use of this. I forgot the data
set that they have used. You can read
the research paper on that. The link is
there. But you don't have to fine-tune
that. Can you do that? Yes, of course
you can do a finetuning. That's what I
was referring to him. I myself have not
done that. But I will certainly try this
out uh to fine-tune that. That's a good
exercise.
>> So it worked well for your use case.
>> Yeah. It just worked fine. Yeah. Yeah.
Yeah. Yeah. because I used a standard
textbook which are publicly available uh
but convoluted data try to do that with
IKEA data set IKEA data set is good
because you cannot use an OCR based
techniques in that data set and because
that's a very strange sparse data set
and that will give you a good intuition
that okay this is you know you can
understand only you can answer those
questions if you if somebody asks you
that question from that uh IKEA uh
manual you can do that not um a computer
if you use a traditional technique. So
that actually a good data point to make
use of this. Okay.
All right. Thank you so much everyone
for uh uh coming. I really appreciate
and and one last thing is if you need u
uh any AWS credit for any of your
project uh just ping me on LinkedIn.
I'll share a few credits. Okay. Even if
you need more, I can give you more.