Evaluating AI Search: A Practical Framework for Augmented AI Systems — Quotient AI + Tavily

Channel: aiDotEngineer

Published at: 2025-07-29

YouTube video id: wRJD0inpmjU

Source: https://www.youtube.com/watch?v=wRJD0inpmjU

[Music]
Hi everyone. Uh, thank you so much for
coming. Uh, my name is Julia. I'm CEO
and co-founder of Quotient AI. Uh, I'm
Danna Emmery. I am founding AI
researcher at Quotient AI. My name is
Mara Sher. I'm head of engineering at
and today we are going to talk to you
about uh evaluating AI search. So let me
start with a fundamental challenge we're
all facing in AI today. Traditional
monitoring approaches simply aren't
keeping up with the complexity of modern
AI approaches. First off, these systems
are dynamic. Unlike traditional
software, AI agents operate in
constantly changing environments.
They're not just executing predetermined
logic. They're making real-time
decisions based on evolving web content,
user interactions, and complex tool
chains. The these systems can also have
multiple failure modes that happen at
the same time. They hallucinate,
retrieval fails, uh they make reasoning
errors and all of these are
interconnected.
A little bit about what we do at
Quotient. We monitor live AI agents. Uh
we have expert evaluators that can
detect objective system failures without
waiting on ground truth data, human
feedback or benchmarks.
A year ago, we met Rom Te's founder and
CEO and he posed us with a problem uh
that really crystallized the core issues
we needed to solve.
Here's the challenge. How do you build
production readyi search agents when
your system will be dealing with two
fundamental sources of unpredictability
you cannot proactively control? Under
the hood, Tavil's agents gathered their
context by searching the web. The web
the web is not static. Traditional
benchmarks assume stable ground truth,
but when you're dealing with real-time
information, ground through itself is a
moving target. Your users also don't
stick to your test cases. They can ask
odd malformed questions. They have
implicit context they don't really share
and you're not aware of. Uh, and this is
not just a theoretical problem. Tavill's
uh Tavilli processes hundreds of
millions of search requests for its
agents in production and they need a
solution that worked at scale in these
real world conditions and this is a
story of how we built that.
Yes. So at Tavilli we're building the
infrastructure layer for a gent
interaction at scale essentially
providing language models with real-time
data from across the web.
There are many use cases where real time
AI search deliver values and this is
just a few example of how our clients
are using Tavili to empower their
applications. beginning from a CLM
company that built an AI legal assistant
to power their legal and business team
with instant case insight to a sports
news outlet that they created a hybrid
rag chat agent that delivers scores,
games, and news updates to a credit card
company that uses real-time search to uh
fight fraud by pinpointing merch
merchant locations.
So as you can imagine, evaluate a system
in this kind of vast fastmoving setting
is quite challenging.
We have two principles that guide our
evaluation. First, the web, which is our
foundation of our data, is constantly
changing. This means that our evaluation
method must keep up with the ongoing
change.
Second, that truth is often subjective
and contextual.
Evaluating correctness can be tricky
because what's right may depend on the
source or the timing or the user needs.
So we have a responsibility to design
our evaluation methods to be as unbiased
and fair as possible even when absolute
truth is hard to pin down.
So the first thing to think about in
offline evaluation is which data to use
to evaluate your system.
Static data sets are a great start and
there are many widely uh open-source
data sets available out in the web.
Simple QA is one example. It's a
benchmark and a data sets from open AI
that um serve as a standard for
evaluating retrieval accuracy. We have
many many leading AI search providers
that use simple QA to evaluate their
performance. Simple QA is designed to
evaluate a system ability to answer
short fact-seeking question with a
single empirical answer.
Another widely uh adopted data set is
hotspot QA which uh tests the which
evaluates a system ability to answer
multihop questions where reasoning
across multiple documents is required to
retrieve the final answer. data set like
simple QA and hotspot QA are a great
start for evaluating your system. But
what happens when you're um when you're
evaluating real time um systems that um
especially when measuring your that your
system keeps up with rapidly evolving
information and avoiding regress
regression like where we operate also
those kind of static data sets uh don't
address the challenge of benchmarking
question questions where uh they don't
where there's is no one truth answer or
subjectivity is involved.
This is what led us to think beyond
static data sets towards dynamic
evaluation that reflects the changing uh
the pace of the of the web. Essentially
um dynamic data set are essential for
for benchmarking rags in real world
production system. You can answer
today's questions with yesterday data.
Dynamic data sets have real world
alignment. They have broad coverage as
you can easily create evil sets for any
domain or use case that is relevant to
your specific needs. And they also
ensure continuous relevancy because they
are regularly refresh which means that
your system is con is always evaluated
against the latest data.
This led us to build an open-source um
agent that basically build dynamic eval
sets for web-based rug system. It's open
source and we encourage everyone to
check it out and contribute.
And I also want to acknowledge the work
of Ayal, our head of data at Savili, who
initiated this project a couple months
ago.
As you can see here, an example of a
data set generated by the agent. It
generates question and answer pairs for
targeted domains using information found
in the web.
So the agent leverage the langraph
framework and it consists of these key
steps. First it generates broad web
search queries for targeted domains
which essentially let you create eval
sets for any uh domain of your choice
and specific need of your application.
The second step is to aggregorate
grounding documents from multiple
real-time AI search providers. We
understand that we cannot just use
Tavili to search the web on specific
domains, find grounding documents, then
generate question and answer pairs from
those documents and then evaluate our
performance on those documents. That's
why we use multiple real-time AI search
providers to both maximize coverage and
minimize bias.
The third step which is the key uh step
in this process is to generate the
evidencebased question and answer pairs.
And we ensure that in the generation
process the agent is obliged to to
generate answer context which al which
also increase the reliability of our
question and answer pairs and reduce
hallucinations.
You can always go back and and check
which sources were used and which
evidence from so those sources were used
to generate each question and answer
pair. And lastly, we use length miss to
track our experiments which is a great
observability tool to manage these uh
offline um evaluation runs and see how
your performance at different time
steps.
The next steps that we want to address
is to support a range of question types
both simple factbased questions and
multi-hop questions similar to the
hotspot QA. We also want to ensure
furnish and fairness and coverage by
proactively addressing bias and covering
a wide range of perspective for each
subject we generate question and answer
to.
Additionally, we want to add a
supervisor node for coordination which
prove itself to be valuable especially
in these multi- aents uh architectures
and this will increase the quality of
our question and answer pairs.
The next step to think about is uh
benchmarking and we argue that it's
important to measure accuracy but you
should not stop there. You should ensure
an holistic evaluation framework which
use benchmark like for for our case that
um measure your your source diversity
your uh source relevancy and hallucation
rates. It's also important to leverage
unsupervised evaluation method that
remove the need for label for label data
which enable to scale your evaluations
and address the subjectivity uh issue.
With that, I'll pass it over to Diana
who who will explain more about this
reference free benchmarks and also share
a results from uh an experiment we ran
using a static and a dynamic data set
that was generated by the agent I
described before.
So, uh we performed a two-part
evaluation of six different AI search
providers. Um the first component of
this experiment was to compare the
accuracy of search providers on a static
and a dynamic benchmark in order to
demonstrate that static benchmarking is
not a comprehensive method for
evaluation of AI search. The second
component was to evaluate the uh dynamic
data set responses using reference free
metrics and we compare these results to
the uh reference based accuracies that
we get from the benchmark in order to
demonstrate that reference reevaluation
can be an effective substitute when
ground truths are not available.
So jumping right in um for our static
versus dynamic benchmarking comparison
we use simple QA benchmark as the static
data set and we're using a dynamic
benchmark of about a thousand rows
created by Tivi
and as you can see here uh both data
sets have roughly similar distributions
of topics and this helps to ensure a
fair comparison and diversity of
questions
to evaluate the AI search providers.
performance on these two benchmarks.
We're using the simple QA correctness
metric and this is an LLM judge which is
used on the simple QA benchmark. It
compares the model's response against a
ground truth answer in order to
determine if it's correct, incorrect, or
not attempted.
And so here we're showing the
correctness scores from that simple QA
benchmark compared against the dynamic
benchmark. And uh we've anonymized the
search providers for this talk. Um but I
do want to call out that the simple QA
accuracy scores here are all
self-reported and so they don't all
necessarily have clear documentation on
how they were calculated. But um as you
can see the correctness scores are for
the dynamic benchmark in blue are
substantially lower. Um and not only
that the relative rankings have also
changed pretty considerably. For
example, um, provider F all the way on
the end of this plot here performs the
worst on simple QA, but it performs the
best on the dynamic benchmark.
Looking a little closer to the in the
results, um, while this simple QA
evaluator is useful, it's certainly far
from perfect. Um, I have a few examples
here of, um, model responses that were
flagged as incorrect by this LLM judge.
Uh but if you look at the actual text in
the model outputs, they do contain the
correct answer from the ground truth.
On the flip side of things, um here is
an example where uh the LLM judge
classified it as correct. And yes, you
can see that the correct answer is in
this response.
But while the correct answer might be
present, that doesn't necessarily mean
that the full answer is right. Um this
evaluation is not accounting for any of
the additional text in this response.
and there might be hallucinations in
there and that would invalidate it. So
ultimately this evaluation falls short
of identifying when things go wrong in
AI search.
So what are some other ways that we can
identify when things go wrong? Up to
this point we have been talking about a
reference-based approach to evaluation.
But what if we don't have ground truths?
In most online and production settings
this is typically the case. And as we've
already discussed, it's especially so in
AI search. Um, so the question is, can
reference free metrics effectively
identify issues in AI search? For this
talk, we're going to look at three of
quotient's reference free metrics. Um,
we'll look at answer completeness, which
identifies whether all components of the
question were answered. Um, so it
classifies model responses as either
fully addressed, unadressed, or unknown.
Uh, if the model says I don't know, then
we'll look at document relevance, and
this is the percent of the retrieved
documents that are actually relevant to
addressing the question. Um, and then
finally, we'll look at hallucination
detection, which identifies whether
there are any facts in the model
response that are not present in any of
the retrieved documents. And so we use
these metrics to evaluate the search
provider's responses on this dynamic
benchmark.
So we've got answer completeness plotted
here. Um the stacked bar plot shows the
number of responses that were either
completely answered uh unressed or
marked as unknown.
And if we look back at the overall
rankings that we saw earlier on the
dynamic benchmark, um you can see that
the rankings from answer completeness
pretty closely match. Um the average
performance scores for the two get a
correlation of 0.94.
So this indicates that the reference
free metric can capture relative
performance pretty well. But
completeness is still not the same thing
as correctness. Um and when we have no
ground truths available then we have to
turn to the next best thing and that is
the grounding documents.
So this is where document relevance and
hallucination detection come in. Uh both
of these metrics are going to be looking
at those grounding documents in order to
measure the quality of the model's
response.
Unfortunately, uh, of all of the search
providers we looked at, only three of
them actually return the retrieved
documents used to generate their
answers. Um, the majority of search
providers typically only provide
citations and these are largely
unhelpful at scale and also really limit
transparency when it comes to debugging.
So, these are those document relevance
scores for the three search providers.
Um, and they've been reanomized here. Um
the plot to the left shows the average
document relevance, the percent of
retrieved documents that are relevant to
the question. And the plot to the right
shows the number of responses that have
no relevant documents.
And if we consider these results in
conjunction with answer completeness, we
find that there's a strong inverse
correlation between document relevance
and the number of unknown answers. And
this kind of matches intuition. uh if
you think about it, if you have no
grounding, no relevant documents for the
question, the model should say I don't
know rather than trying to answer it.
And so this brings us to hallucination
detection. And here we were actually
surprised to see that there was a direct
relationship with the hallucination rate
and document relevance. Provider X here
has the highest hallucination rate, but
it also had the highest overall document
relevance. And this is kind of
counterintuitive. Um, but if we think
about it more, provider X had high
answer completeness, the lowest rate of
unknown answers, and it also had the
highest answer correctness from the
benchmarking earlier of these three
providers.
So, this probably implies that maybe in
provider X's responses, um, they're more
likely to provide new reasoning or
interpretations in their response, or
maybe even they're more detailed and
thorough, and this just creates more
opportunity for hallucination in their
responses. Um, but the point I want to
make here is that when considering these
metrics, depending on your use case, you
might index more heavily on one over
another. um they're measuring different
dimensions of response quality and it's
often a give and take. If you perform
really well in one, it might be at the
expense of another. Um and as we see
here, uh there is a trade-off between
complete answer completeness and
hallucination.
But also, um if you take these three
metrics in conjunction, you can use them
to understand why things went wrong and
uh identify potential strategies for
addressing those issues. This diagram
here shows a few examples on how you can
interpret your evaluation results. Um
sorry uh how you can interpret your
evaluation results to identify what to
do to fix it. So we've got one example
here where um maybe your response is
incomplete um but you have relevant
documents, you have no hallucinations.
Uh so this probably means you don't have
all the information you need to answer
the question. uh and so just retrieving
more documents might solve that. Um but
the big picture idea is that your
evaluation should do more than just
provide relative rankings. It should
help you identify the types of issues
that are present and it should also help
you understand what strategies to
implement to solve those issues.
Okay. So uh so in conclusion, let me
just quickly paint a picture of where
we're heading with all this because this
is not just about building the agents
we've been building for the past couple
years and then slapping evaluation on it
and then continuing to do the same
thing. Uh it's actually it's not about
building better benchmarking. It's not
better monitoring. It's not about better
evaluation. It's about creating AI
systems that can uh continuously improve
themselves. And imagine for a second
that agents don't just retrieve
information but learn from the patterns
of what information is outdated, what
sources are unreliable and what users
need. Um they can also like maybe detect
hallucinations mid conversations and uh
correct the course all without human
intervention.
And this framework that we shared today,
dynamic data sets, holistic evaluation,
reference free metrics are the building
blocks for getting there. Uh, and this
this is where we want to get with
augmented AI. So, thank you so much for
your time.