How to defend your sites from AI bots — David Mytton, Arcjet
Channel: aiDotEngineer
Published at: 2025-07-30
YouTube video id: Gi4V8viBGYQ
Source: https://www.youtube.com/watch?v=Gi4V8viBGYQ
[Music] Hi everyone. So my name is David. I'm the founder of ArtJet. We provide a security SDK for developers. So everything I'm going to be talking to you about today is what we've been building for the last few years, but how you can do it yourself. So, if you haven't had bots visiting your website and felt the pain, then you might be thinking, well, is this really a problem? Well, as as you just heard in the introduction, almost 50% of web traffic today is automated clients. And that varies depending on the industry. In gaming, that's almost 60% of all traffic is automated. And that's before the agent revolution has really kicked off. This isn't a new problem. It's been going on since the invention of the internet and there are bots that you want to visit your website like Google bot but there also a lot of malicious crawlers and this causes a problem. The first incident you might experience is around expensive requests and think through what happens on your website. If it's a static site, then maybe it's not doing much um on your infrastructure. But if you're generating any content from a database or you're reading some dynamic content in some way, then each request is going to cost something. Particularly if you're using a serverless platform, paying per request. If you have huge numbers of automated clients coming in making requests, making hundreds of thousands of requests, then this starts to build up as a problem, as a cost problem. Um and also being able to deal with that on your infrastructure. And these clients can also be requesting all the assets. So downloading large files, that's going to start eating into your bandwidth costs and eating into the available resources you have to serve legitimate users on your site. This can show up as a denial of service attack. So your service just might not be available to others. And even the largest website doesn't have infinite resources. Serverless means that you don't have to think about that for the most part. But what where you're actually handling it is part of the billing. This has been a problem for for decades. And so the real question is well is AI making this worse? We see complaints in the media um websites talking about the traffic that they're getting and there's just an automatic assumption that this is AI and on the face of it there's no real evidence that that is the case. But when you start looking into the details about the kind of requests that these sites are are seeing, then AI is making it worse. So for instance, Diaspora, which is an online um online open source community, they saw that 24% of their traffic was from GPTBOT, which is OpenAI's crawler. And then read the docs, which is an online documentation platform for code code projects. They found that by blocking all AI crawlers, they reduce their bandwidth from 800 gigabytes a day to 200 gigabytes a day. And even Wikipedia is having this problem. They're spending up to 35% of their traffic just serving automated clients. And they're seeing this increasing significantly and attributing that to AI crawlers. So AI is making this worse. Scrapers are coming onto sites and pulling down the content and they're not behaving nicely. They're not doing it in a gradual way and they're making hundreds of thousands of requests and just pulling down content without following the rules. In the old days, we had this idea of good bots and bad bots. And the challenge was always distinguishing between them. If you want your website to show up in a search index like Google, then Google has to know about your site. has to visit and understand your site, but you get a benefit from that because you're going to appear in the search index and you're going to get traffic as a result. And so most people consider Google to be a good bot. And then there's the bad bots, which are obviously bad. Scrapers coming to your site, downloading all the images, downloading all the content, downloading files. It was very easy to understand that those are the bad bots. But in the middle, we've got these AI crawlers. And sometimes they're good, sometimes they're bad. And it depends on sometimes your philosophical approach to AI, but also what you want from your website because the first kinds of AI bots we were seeing were for training just to build up the models and in theory there's no benefit to the site owner for that because it's just being built into the model. You're not necessarily getting any traffic. But things have started to change with multiple bots coming from the different AI providers. So for instance with open AAI they have at least four different types of bots. So the first one is the open AAI search bot. This is kind of classic Google bot type um crawler which will come to your site. It will understand what's going on and it will index it so that when someone makes a query into chat GPT using the search functionality you show up in OpenAI's index. Now in most cases you're going to want that. It's going to do the same thing as Google. You're going to appear in a search index. you're probably going to get citations and that wasn't the case at the very beginning, but now you're getting citations and this is becoming a real source of traffic for sites and for services. People are getting signups as a result and so there's a win-win. It's it's the same as as the old Google crawlers. Then there's chat GPT user and this is a little more nuanced. It's where chat GPT may show up to your website as a result of a real-time query that a user is making. Maybe you drop the actual URL into the chat and ask it to summarize the content or it's a documentation link and you want to understand how to implement something and it's going out and getting that content. It's not used for training, but it may not site the response. But if you've given it the URL, then perhaps you're you're a legitimate user. And so maybe you do want that because it's actually your users making use of of LLMs. And then there's GPT bot which is the one that we saw was taking up a huge amount of um of traffic on Wikipedia and Diaspora and this is the original one that is part of the training. It doesn't benefit you directly. You're being br b brought into the model and there's often no citation as a result. These are kind of the three crawler bots that you might see on your site. And then what we're seeing more of now is the computer use operator type bots which are acting on behalf of a real person possibly with a web browser that's running in a VM that is taking an action as an agent an autonomous agent. And this becomes challenging to understand well do you want that or not? Maybe it's a legitimate use case. Maybe the agent is doing triage of your inbox. Maybe Google would want that if it's Gmail. But if you've asked an agent to go out and buy 500 concert tickets so you can then go and sell them for a profit, that's probably something you don't want to allow. And so understanding being able to detect these is really challenging. The OpenAI crawlers identify themselves as such. You can verify that and so you can allow or block them. But something like operator just shows up as a Chrome browser and it's much more challenging to understand and detect that. So let's walk through some of the defenses that you can implement and to decide as a site owner how you can control the kind of traffic that's coming to your site. So the first one of these is it's not really a defense because it's entirely voluntary. Everyone's probably heard of robots.ext. It's how you can describe the structure of your website and tell different crawlers what you want them to do. You can allow or disallow. You can um control particular crawlers. And this gives you a good understanding of your own site to think through the steps that you want to take to allow or disallow. But it's entirely voluntary. Crawlers don't have to follow it, but the good ones will. Google bot will follow this as will all the search crawlers. Open AAI claims to follow it and does for the most part as well. But for the types of bots that are causing these problems, they're not following this. And in some cases, they're actually using this to find pages on your site that you've disallowed other bots to go to and deliberately going out and getting that content. Even so, this is a good place to start because it helps you start to think through what you want different bots to be doing on your site. Every request that comes into your site is going to identify itself. This is a required HTTP header. Um, and it's just a string. It is a name that the crawler is going to give itself and you'll see that in your request logs. It's just a string because any any a client can set whatever they like for this. But it's surprising how many will actually just tell you who they are. And you can you can use open source libraries um to detect this and create rules around it. At ArtJet, we've got a open source project with um several thousand different user agents um that you can download and use to build your own rules to identify who you want to access your site, but it's just a string in a HTTP header and you can set it to whatever you want. And so the bad bots will just change this. They'll pretend to be Google or they'll pretend to be Chrome. And so it is not always a good signal about who's actually visiting your site. So the next thing you can do is to verify that if a request is made to your site and it claims to be Apple's crawler, Bing, Google, OpenAI, all of these services support verification. So you can look at the source IP address and you can query those services using a reverse DNS lookup to check whether it is actually who it claims to be. So if you see a request coming from Google, you can ask Google is this actually Google and they'll give you a response back saying whether it is or not. And this makes it quite straightforward to use the combination of the user agent string plus IP verification to check whether it is the good bots are visiting your site and to set up some simple rules to allow those crawlers that you actually want to be on the site. things start to get a bit more complicated if those signals don't provide you with sufficient information. Bot detection is not 100% accurate and so you have to build up these layers. And so the next thing you can do is looking at IP addresses. The idea is to build up a pattern to understand what is normal from each IP address and not just a single IP address but the different IP address ranges how they associate with different networks and different network operators whether the request is coming from a data center or not and the country level information and you can get this from various databases. um you have to pay for access to most of them, but there are also some free APIs you can use to query the metadata associated with a particular IP address. MaxMine and IP info are two more popular ones. And you want to be looking at things like, well, where's the traffic coming from? And what is the association with the network? Is this coming from a VPN or a proxy? Is it a residential or mobile IP address? And last year, 12% of all bot traffic that hit the Cloudflare network was from the AWS network. And so you can start to ask yourself, well, are the normal users of our site and application going to come from a data center. Maybe if you're allowing crawlers on your site, then that's ex that's expected. But if you have a signup form that you're expecting only humans to submit, then it's unlikely that a request that's coming from a data center IP address is going to be traffic that you want to accept. The challenge with looking at geo data like blocking a single country for instance um is that the geodata is notoriously inaccurate and has become more inaccurate over time as people are using satellite and cell phone connectivity 5G because the IP address will be geoloccated to the owner of the IP rather than necessarily the the user of it. And also even when the database is saying that the IP address is coming from a residential network, there are proxy services that you can just buy access to which will route your traffic through those residential um networks to appear like it's coming from a home ISP or a mobile device. So you can't always trust these and you have to build up signals and build your own database to understand where this traffic is coming from and what the likelihood is that it's an automated client. Captures are the standard thing that we've been using um now for decades to try and distinguish between humans and automated clients, solving puzzles, um moving things around on the screen, but it's becoming increasingly easy for AI to solve those. Putting them into an LLM or downloading the audio version and transcribing it can be done in just a couple of seconds. and it's trivial and cheap to breach these kinds of defenses. There are newer approaches to this. Proof of work, which is come from the crypto um side of things, means that you require a computer to do certain number of calculations and provide the answer to to a puzzle before they can access the resource. And this usually takes a certain amount of time. It costs CPU time and on an individual basis on your laptop or on your phone, it might take a second or two to calculate it and it makes no real difference to an individual. But if you have a crawler that's going to tens of thousands or millions of websites and is having to solve this puzzle every single time, it becomes very expensive to do that. And so deploying these proofof work options on your website can be a way to um to prevent those crawlers. But then it becomes a question of incentives. So if you're crawling millions of websites, then maybe that is a good defense. But if we go back to that ticket example, if it costs someone a couple of dollars to solve a capture or to solve a proof of work, but they're then going to sell a ticket for $200 or $300, the profit is still there. And so these may not even be a defense against certain types of um of attacks. You can scale the difficulty. So, if you bring in all these different signals and see that something is coming from an unverified IP address and has suspicious um suspicious uh characteristics, then maybe you could give them a harder puzzle. But then you start to have accessibility problems and I'm sure we've all seen those really annoying captures that you can't solve and you have to keep refreshing. Um that becomes a problem as well. There are a couple of interesting open source projects that implement these. Anubis is a good one. Go away and Nepenthees. These are all proxies that you can install on the Kubernetes cluster um or put them in front of your your application. You can run it yourself and it will implement these proof of work problems and put it in front of the users that it thinks are suspicious. And there are also some emerging standards around introducing signatures into requests because what we're trying to do is to prove that a particular client is who they say it is and is who you want to be on the website. Now, Cloudflare has suggested this idea of HTTP message signatures for automated clients where every request will include a cryptographic signature which you can then verify very quickly and then you can understand which which client is coming to your site. This is only just been announced a couple of weeks ago, so it's still being developed. There's some questions around whether it's any better than just verifying the IP address, but it's a way of verifying automated clients. And then a couple of years ago, Apple announced private access tokens and what they called a privacy pass, which allowed website owners to verify that a request was coming from a browser that was owned by an iCloud subscriber. This has been implemented across all Apple devices. If you're using Safari, this is on um and it will reduce the number of captures that you might see because you can verify that someone's actually a paying subscriber to iCloud. but it's had limited adoption elsewhere. Not many sites are using it and it's only on the Apple ecosystem even though it's a it's almost an approved standard. And then we have to implement fingerprints as well. So fingerprinting is looking at the network requests to generate a hash to be able to identify that client because it's quite trivial to change the IP address that your requests are coming from. And you'll often see crawlers using banks of tens or hundreds of thousands of different IP addresses, particularly with IPv6, which means implementing um signatures based just on an IP address isn't sufficient, but the client stays the same. The client characteristics stay the same across multiple requests and you can build up a fingerprint of that. This is the open source J4 hash which is based on the TLS fingerprint looking at the network level um and looking at the configuration of SSL and then there's a proprietary version on HTTP which looking at headers um and the different headers that are sent with a client and the characteristics of an HTTP request to build up a fingerprint and then you can use those fingerprints as part of your block rules. So you could look at all the hundreds of thousands of requests coming from a single fingerprint. You could just block that fingerprint regardless of how many IP addresses it's coming across. And then rate limiting is used in conjunction with a fingerprint. Once you can fingerprint the client, you can apply quotas or a limit to it. And the key is really important there. You can't just rate limit on an IP address because people have different IPs. It changes all the time. And then for malicious crawlers, they can just change them themselves. And so keying off user session ID is a good way to do it. If the user is logged in and you want to apply your rate limits or if you've got the um the fingerprint, the J4 hash, you can implement rate limits on that. So these are the eight defenses. Robots.ext is where you start. It's not where you finish though because it's not going to prevent all the bots. It's a voluntary standard. It's where you start because it helps with the the good bots. At the very least, you need to be looking at user agents. There are various open source options for looking at that and setting up rules and then verifying that the user agent for clients that you actually want on your site are the ones that are actually making the requests. That gets you most of the way. For most sites, that will deal with everything you need. But for the more popular ones or sites with particularly um interesting resources or things that that people might want to buy lots lots of numbers of or they're in restricted quantities, you need to go further looking at the IP reputation, setting up proof of work, considering these experimental HTTP signatures and certainly the fingerprint side of things is where most people land in combination with the rate limits. You can implement all these yourselves in code. That's what we do at ArtJet. There's a much more detailed writeup of this talk on uh the blog that I just published earlier today. So, if you have a look at blog.artjet.com, there's a full write up of this talk with much more detailed examples. I'm happy to answer any questions via email and we also have a booth down in the expo. But, thank you very much. [Music]