How to defend your sites from AI bots — David Mytton, Arcjet

Channel: aiDotEngineer
Published at: 2025-07-30
YouTube video id: Gi4V8viBGYQ
Source: https://www.youtube.com/watch?v=Gi4V8viBGYQ
[Music]
Hi everyone. So my name is David. I'm
the founder of ArtJet. We provide a
security SDK for developers. So
everything I'm going to be talking to
you about today is what we've been
building for the last few years, but how
you can do it yourself.
So, if you haven't had bots visiting
your website and felt the pain, then you
might be thinking, well, is this really
a problem? Well, as as you just heard in
the introduction, almost 50% of web
traffic today is automated clients. And
that varies depending on the industry.
In gaming, that's almost 60% of all
traffic is automated. And that's before
the agent revolution has really kicked
off. This isn't a new problem. It's been
going on since the invention of the
internet and there are bots that you
want to visit your website like Google
bot but there also a lot of malicious
crawlers
and this causes a problem.
The first incident you might experience
is around expensive requests and think
through what happens on your website. If
it's a static site, then maybe it's not
doing much um on your infrastructure.
But if you're generating any content
from a database or you're reading some
dynamic content in some way, then each
request is going to cost something.
Particularly if you're using a
serverless platform, paying per request.
If you have huge numbers of automated
clients coming in making requests,
making hundreds of thousands of
requests, then this starts to build up
as a problem, as a cost problem. Um and
also being able to deal with that on
your infrastructure.
And these clients can also be requesting
all the assets. So downloading large
files, that's going to start eating into
your bandwidth costs and eating into the
available resources you have to serve
legitimate users on your site.
This can show up as a denial of service
attack. So your service just might not
be available to others. And even the
largest website doesn't have infinite
resources.
Serverless means that you don't have to
think about that for the most part. But
what where you're actually handling it
is part of the billing.
This has been a problem for for decades.
And so the real question is well is AI
making this worse?
We see complaints in the media um
websites talking about the traffic that
they're getting and there's just an
automatic assumption that this is AI and
on the face of it there's no real
evidence that that is the case. But when
you start looking into the details about
the kind of requests that these sites
are are seeing, then AI is making it
worse. So for instance, Diaspora, which
is an online um online open source
community, they saw that 24% of their
traffic was from GPTBOT, which is
OpenAI's crawler.
And then read the docs, which is an
online documentation platform for code
code projects. They found that by
blocking all AI crawlers, they reduce
their bandwidth from 800 gigabytes a day
to 200 gigabytes a day. And even
Wikipedia is having this problem.
They're spending up to 35% of their
traffic just serving automated clients.
And they're seeing this increasing
significantly and attributing that to AI
crawlers. So AI is making this worse.
Scrapers are coming onto sites and
pulling down the content and they're not
behaving nicely. They're not doing it in
a gradual way and they're making
hundreds of thousands of requests and
just pulling down content without
following the rules.
In the old days, we had this idea of
good bots and bad bots. And the
challenge was always distinguishing
between them.
If you want your website to show up in a
search index like Google, then Google
has to know about your site. has to
visit and understand your site, but you
get a benefit from that because you're
going to appear in the search index and
you're going to get traffic as a result.
And so most people consider Google to be
a good bot.
And then there's the bad bots, which are
obviously bad. Scrapers coming to your
site, downloading all the images,
downloading all the content, downloading
files. It was very easy to understand
that those are the bad bots. But in the
middle, we've got these AI crawlers. And
sometimes they're good, sometimes
they're bad. And it depends on sometimes
your philosophical approach to AI, but
also what you want from your website
because the first kinds of AI bots we
were seeing were for training just to
build up the models and in theory
there's no benefit to the site owner for
that because it's just being built into
the model. You're not necessarily
getting any traffic. But things have
started to change with multiple bots
coming from the different AI providers.
So for instance with open AAI they have
at least four different types of bots.
So the first one is the open AAI search
bot. This is kind of classic Google bot
type um crawler which will come to your
site. It will understand what's going on
and it will index it so that when
someone makes a query into chat GPT
using the search functionality you show
up in OpenAI's index. Now in most cases
you're going to want that. It's going to
do the same thing as Google. You're
going to appear in a search index.
you're probably going to get citations
and that wasn't the case at the very
beginning, but now you're getting
citations and this is becoming a real
source of traffic for sites and for
services. People are getting signups as
a result and so there's a win-win. It's
it's the same as as the old Google
crawlers.
Then there's chat GPT user and this is a
little more nuanced. It's where chat GPT
may show up to your website as a result
of a real-time query that a user is
making. Maybe you drop the actual URL
into the chat and ask it to summarize
the content or it's a documentation link
and you want to understand how to
implement something and it's going out
and getting that content. It's not used
for training, but it may not site the
response. But if you've given it the
URL, then perhaps you're you're a
legitimate user. And so maybe you do
want that because it's actually your
users making use of of LLMs.
And then there's GPT bot which is the
one that we saw was taking up a huge
amount of um of traffic on Wikipedia and
Diaspora and this is the original one
that is part of the training. It doesn't
benefit you directly. You're being br b
brought into the model and there's often
no citation as a result. These are kind
of the three crawler bots that you might
see on your site. And then what we're
seeing more of now is the computer use
operator type bots which are acting on
behalf of a real person possibly with a
web browser that's running in a VM that
is taking an action as an agent an
autonomous agent. And this becomes
challenging to understand well do you
want that or not? Maybe it's a
legitimate use case. Maybe the agent is
doing triage of your inbox. Maybe Google
would want that if it's Gmail. But if
you've asked an agent to go out and buy
500 concert tickets so you can then go
and sell them for a profit, that's
probably something you don't want to
allow. And so understanding being able
to detect these is really challenging.
The OpenAI crawlers identify themselves
as such. You can verify that and so you
can allow or block them. But something
like operator just shows up as a Chrome
browser and it's much more challenging
to understand and detect that.
So let's walk through some of the
defenses that you can implement and to
decide as a site owner how you can
control the kind of traffic that's
coming to your site. So the first one of
these is it's not really a defense
because it's entirely voluntary.
Everyone's probably heard of robots.ext.
It's how you can describe the structure
of your website and tell different
crawlers what you want them to do. You
can allow or disallow. You can um
control particular crawlers. And this
gives you a good understanding of your
own site to think through the steps that
you want to take to allow or disallow.
But it's entirely voluntary.
Crawlers don't have to follow it, but
the good ones will. Google bot will
follow this as will all the search
crawlers. Open AAI claims to follow it
and does for the most part as well. But
for the types of bots that are causing
these problems, they're not following
this. And in some cases, they're
actually using this to find pages on
your site that you've disallowed other
bots to go to and deliberately going out
and getting that content.
Even so, this is a good place to start
because it helps you start to think
through what you want different bots to
be doing on your site.
Every request that comes into your site
is going to identify itself. This is a
required HTTP header. Um, and it's just
a string. It is a name that the crawler
is going to give itself and you'll see
that in your request logs. It's just a
string because any any a client can set
whatever they like for this. But it's
surprising how many will actually just
tell you who they are. And you can you
can use open source libraries um to
detect this and create rules around it.
At ArtJet, we've got a open source
project with um several thousand
different user agents um that you can
download and use to build your own rules
to identify who you want to access your
site, but it's just a string in a HTTP
header and you can set it to whatever
you want. And so the bad bots will just
change this. They'll pretend to be
Google or they'll pretend to be Chrome.
And so it is not always a good signal
about who's actually visiting your site.
So the next thing you can do is to
verify that if a request is made to your
site and it claims to be Apple's
crawler, Bing, Google, OpenAI,
all of these services support
verification. So you can look at the
source IP address and you can query
those services using a reverse DNS
lookup to check whether it is actually
who it claims to be. So if you see a
request coming from Google, you can ask
Google is this actually Google and
they'll give you a response back saying
whether it is or not. And this makes it
quite straightforward to use the
combination of the user agent string
plus IP verification to check whether it
is the good bots are visiting your site
and to set up some simple rules to allow
those crawlers that you actually want to
be on the site.
things start to get a bit more
complicated if those signals don't
provide you with sufficient information.
Bot detection is not 100% accurate and
so you have to build up these layers.
And so the next thing you can do is
looking at IP addresses.
The idea is to build up a pattern to
understand what is normal from each IP
address and not just a single IP address
but the different IP address ranges how
they associate with different networks
and different network operators whether
the request is coming from a data center
or not and the country level information
and you can get this from various
databases. um you have to pay for access
to most of them, but there are also some
free APIs you can use to query the
metadata associated with a particular IP
address. MaxMine and IP info are two
more popular ones. And you want to be
looking at things like, well, where's
the traffic coming from? And what is the
association with the network? Is this
coming from a VPN or a proxy? Is it a
residential or mobile IP address?
And last year, 12% of all bot traffic
that hit the Cloudflare network was from
the AWS network. And so you can start to
ask yourself, well, are the normal users
of our site and application going to
come from a data center. Maybe if you're
allowing crawlers on your site, then
that's ex that's expected. But if you
have a signup form that you're expecting
only humans to submit, then it's
unlikely that a request that's coming
from a data center IP address is going
to be traffic that you want to accept.
The challenge with looking at geo data
like blocking a single country for
instance um is that the geodata is
notoriously inaccurate and has become
more inaccurate over time as people are
using satellite and cell phone
connectivity 5G because the IP address
will be geoloccated to the owner of the
IP rather than necessarily the the user
of it. And also even when the database
is saying that the IP address is coming
from a residential network, there are
proxy services that you can just buy
access to which will route your traffic
through those residential um networks to
appear like it's coming from a home ISP
or a mobile device. So you can't always
trust these and you have to build up
signals and build your own database to
understand where this traffic is coming
from and what the likelihood is that
it's an automated client.
Captures are the standard thing that
we've been using um now for decades to
try and distinguish between humans and
automated clients, solving puzzles, um
moving things around on the screen, but
it's becoming increasingly easy for AI
to solve those. Putting them into an LLM
or downloading the audio version and
transcribing it can be done in just a
couple of seconds. and it's trivial and
cheap to breach these kinds of defenses.
There are newer approaches to this.
Proof of work, which is come from the
crypto um side of things, means that you
require a computer to do certain number
of calculations
and provide the answer to to a puzzle
before they can access the resource. And
this usually takes a certain amount of
time. It costs CPU time and on an
individual basis on your laptop or on
your phone, it might take a second or
two to calculate it and it makes no real
difference to an individual. But if you
have a crawler that's going to tens of
thousands or millions of websites and is
having to solve this puzzle every single
time, it becomes very expensive to do
that. And so deploying these proofof
work options on your website can be a
way to um to prevent those crawlers.
But then it becomes a question of
incentives. So if you're crawling
millions of websites, then maybe that is
a good defense. But if we go back to
that ticket example, if it costs someone
a couple of dollars to solve a capture
or to solve a proof of work, but they're
then going to sell a ticket for $200 or
$300,
the profit is still there. And so these
may not even be a defense against
certain types of um of attacks.
You can scale the difficulty. So, if you
bring in all these different signals and
see that something is coming from an
unverified IP address and has suspicious
um suspicious uh characteristics, then
maybe you could give them a harder
puzzle. But then you start to have
accessibility problems and I'm sure
we've all seen those really annoying
captures that you can't solve and you
have to keep refreshing. Um that becomes
a problem as well.
There are a couple of interesting open
source projects that implement these.
Anubis is a good one. Go away and
Nepenthees. These are all proxies that
you can install on the Kubernetes
cluster um or put them in front of your
your application. You can run it
yourself and it will implement these
proof of work problems and put it in
front of the users that it thinks are
suspicious.
And there are also some emerging
standards around introducing signatures
into requests because what we're trying
to do is to prove that a particular
client is who they say it is and is who
you want to be on the website. Now,
Cloudflare has suggested this idea of
HTTP message signatures for automated
clients where every request will include
a cryptographic signature which you can
then verify very quickly and then you
can understand which which client is
coming to your site. This is only just
been announced a couple of weeks ago, so
it's still being developed. There's some
questions around whether it's any better
than just verifying the IP address, but
it's a way of verifying automated
clients. And then a couple of years ago,
Apple announced private access tokens
and what they called a privacy pass,
which allowed website owners to verify
that a request was coming from a browser
that was owned by an iCloud subscriber.
This has been implemented across all
Apple devices. If you're using Safari,
this is on um and it will reduce the
number of captures that you might see
because you can verify that someone's
actually a paying subscriber to iCloud.
but it's had limited adoption elsewhere.
Not many sites are using it and it's
only on the Apple ecosystem even though
it's a it's almost an approved standard.
And then we have to implement
fingerprints as well. So fingerprinting
is looking at the network requests to
generate a hash to be able to identify
that client because it's quite trivial
to change the IP address that your
requests are coming from. And you'll
often see crawlers using banks of tens
or hundreds of thousands of different IP
addresses, particularly with IPv6, which
means implementing
um signatures based just on an IP
address isn't sufficient, but the client
stays the same. The client
characteristics stay the same across
multiple requests and you can build up a
fingerprint of that. This is the open
source J4 hash which is based on the TLS
fingerprint looking at the network level
um and looking at the configuration of
SSL
and then there's a proprietary version
on HTTP which looking at headers um and
the different headers that are sent with
a client and the characteristics of an
HTTP request to build up a fingerprint
and then you can use those fingerprints
as part of your block rules. So you
could look at all the hundreds of
thousands of requests coming from a
single fingerprint. You could just block
that fingerprint regardless of how many
IP addresses it's coming across.
And then rate limiting is used in
conjunction with a fingerprint. Once you
can fingerprint the client, you can
apply quotas or a limit to it. And the
key is really important there. You can't
just rate limit on an IP address because
people have different IPs. It changes
all the time. And then for malicious
crawlers, they can just change them
themselves. And so keying off user
session ID is a good way to do it. If
the user is logged in and you want to
apply your rate limits or if you've got
the um the fingerprint, the J4 hash, you
can implement rate limits on that.
So these are the eight defenses.
Robots.ext is where you start. It's not
where you finish though because it's not
going to prevent all the bots. It's a
voluntary standard. It's where you start
because it helps with the the good bots.
At the very least, you need to be
looking at user agents. There are
various open source options for looking
at that and setting up rules and then
verifying that the user agent for
clients that you actually want on your
site are the ones that are actually
making the requests. That gets you most
of the way. For most sites, that will
deal with everything you need. But for
the more popular ones or sites with
particularly um interesting resources or
things that that people might want to
buy lots lots of numbers of or they're
in restricted quantities, you need to go
further looking at the IP reputation,
setting up proof of work, considering
these experimental HTTP signatures and
certainly the fingerprint side of things
is where most people land in combination
with the rate limits.
You can implement all these yourselves
in code.
That's what we do at ArtJet. There's a
much more detailed writeup of this talk
on uh the blog that I just published
earlier today. So, if you have a look at
blog.artjet.com,
there's a full write up of this talk
with much more detailed examples. I'm
happy to answer any questions via email
and we also have a booth down in the
expo. But, thank you very much.
[Music]