Guarding My Git Forge Against AI Scrapers

2025-12-02 (Tue) ; by ~lymkwi (lux) ; 5762 words ; about 29 minutes ;
updated 2026-03-19 (Thu);
#git forge; #forgejo; #nginx; #scrapers;

Contents

Monitoring Iocaine

A summary of the techniques in place to protect my git forge

In August 2024, one of my roommates and partners messaged the apartment group chat, saying she noticed the internet was slow again at our place, and my forgejo was unable to render any page in under 15 seconds.

i investigated, thinking it would be a trivial little problem to solve. Soon enough, however, i would uncover hundreds of thousands of queries a day from thousands of individual IPs, fetching seemingly-random pages in my forge every single day, all the time.

This post summarizes the practical issues that arose as a result of the onslaught of scrapers eager to download millions of commits off of my forge, and the measures i put in place to limit the damage.

# Why the forge?

In the year 2025, on the web, everything is worth being scraped. Everything that came out of the mind of a human is susceptible to be snatched under the vastest labor theft scheme in the history of mankind. This very article, the second it gets published in any indexable page, will be added to countless datasets meant to train foundational large-language models. My words, your words, have contributed infinitesimal shifts of neural-network weights underpinning the largest, most grotesque accumulation of wealth seen over the lifetime of my parents, grandparents, and their grandparents themselves.

Oh, and forges have a lot of commits. See, if you have a public repository that is publicly exposed, every file in every folder for every commit will be connected. Add other options, such as a git blame on a file, and multiply it by the number of files and commits. Add the raw download link, also multiplied by the number of commits.

Say, hypothetically, you have a linux repository available, and only with all the commits in the master branch up to the v6.17 tag from 2025-09-18. That’s 1,383,738 commits in the range 1da177e4c3f4..e5f0a698b34e. How many files is that? Well:

count=0;
while read -r rev; do
    point=$(git ls-tree -tr $rev | wc -l);
    count=$(( $count + $point ));
    printf "[%s] %s: %d (tot: %d)\n" $(git log -1 --pretty=tformat:%cs $rev) $rev $point $count;
done < <(git rev-list "1da177e4c3f4..e5f0a698b34e");
printf "Total: $count\n";

i ran this on the 100 commits before v6.17. If you have git ls-tree -tr $rev, you get both files and directories counted. If you replace it with git ls-tree -r $rev only shows files. i got 72024729 files, and 76798658 files and directories. Running on the whole history of Linux’s master branch yields 78,483,866,182 files, and 83,627,462,277 files and directories.

Now, for a ballpark estimate of the number of pages that can be scraped if you have a copy of Linux, apply the formula:

(Ncommits * Nfiles) * 2 + (Ncommits * Nfilesandfolders) * 2 + Ncommits * 3

That is, applied to my hypothetical Linux repository:

78483866182 * 2 + 83627462277 * 2 + 1383738 * 3 = 324,226,808,132 pages

The *3 accounts for the fact that every file of every commit can be scraped raw, and git-blame’d. The second part of the formula considers every single file or folder page. The third part accounts for the fact that every file of every commit can be diffed with its version of every commit (in theory). The final component considers every commit summary page.

That gives, for me, 324 billion 226 million 808 thousand and 132 pages that can be scraped. From a single repository. Assume that every scraper agent that enters one of these repositories will also take note of every other link on the page, and report it so that other agents can scrapes them. These scrapers effectively act like early 2000s web spiders that crawled the internet to index it, except they do not care about robots.txt, and they will absolutely keep scraping new links again and again with no strategy to minimize the cost on you, as a host.

# The Cost of Scraping

As i am writing the original draft of this section, the longer-term measures i put in place have been removed, so i could gather up-to-date numbers to display how bad the situation is.

i pay for the electricity that powers my git Forge. Okay, actually, one of my roommate does, but we put it on the calc sheet where we keep track of who pays what (when we remember).

At the time i began fighting scrapers, my git forge ran from an old desktop computer plugged in my living room. Now, it is in our home’s rackable server in a virtual machine. i never got to measure differences in power consumption when we got scraped or not scraped on the desktop machine, but i did on the rackable server. If memory serves me right, stopping the wave of scrapers reduced the power draw of the server from ~170W to ~150W.

Right now, with all the hard drives in that server spinning, and every protection off, we are drawing 200W from the power grid on that server. Constantly. By the end of this experiment, me and my roommates will have computed that the difference in power usage caused by scraping costs us ~60 euros a year.

Another tied cost is that the VM that runs the forge is figuratively suffocating from the amount of queries. Not all queries are born equal as well: requests to see the blame of a file or a diff between commits incurs a worse cost than just rendering the front page of a repository. The last huge wave of scraping left my VM at 99+% usage of 4 CPU cores and 2.5GiB of RAM, whereas the usual levels i observe are closer to 4% usage of CPUs, and an oscillation between 1.5GiB and 2GiB of RAM.

As i’m writing this, the VM running forgejo eats 100% of 8 CPU cores.

Additionally, the networking cost is palpable. Various monitoring tools let me see the real-time traffic statistics in our apartment. Before i put the first measures in place to thwart scraping, we could visibly see the traffic coming out of the desktop computer running my forge and out to the internet. My roommates’ complaints that it slowed down the whole internet here were in fact founded: when we had multiple people watching live streams or doing pretty big downloads, they were throttled by the traffic out of the forge.¹

The egress data rate of my forge’s VM is at least 4MBps of data (32Mbps). Constantly.

Finally, the human cost: i have spent entire days behind my terminals trying to figure out 1) what the fuck was going on and 2) what the fuck to do about it. i have had conversations with other people who self-host their infrastructure, desperately trying to figure out workable solutions that would not needlessly impact our users. And the funniest detail is: that rackable server is in the living room, directly in front of my bedroom door. It usually purrs like an adorable cat, but, lately, it’s been whirring louder and louder. i can hear it. when i’m trying to sleep.

# Let’s do some statistics.

i was curious to analyze the nginx logs to understand where the traffic came from and what shape it took.

As a study case, we can work on /var/log/nginx/git.vulpinecitrus.info/ from 2025-11-14 to 2025-11-19. Note that on 2025-11-15 at 18:27 UTC, i stopped the redirection of new agents into the Iocaine crawler maze (see below). At 19:15 UTC, i removed the nginx request limit zone from the /Lymkwi/linux/ path. At 19:16 UTC i removed the separation of log files between IPs flagged as bots, and IPs not flagged as bots.

The three measures i progressively put in place later were: web caching (2025-11-17), manually sending IPs to a garbage generator with a rate-limit (Iocaine 2) (2025-11-14, 15 and 18), and then Iocaine 3 (2025-11-19).

Common Logs	Successful	Delayed (429)	Error (5XX)	Measures in place
2025-11-14	275323	66517	0	Iocaine 2.1 + Rate-limiting
2025-11-15	71712	54259	9802	Iocaine 2.1 + Rate-limiting
2025-11-16	140713	0	65763	None
2025-11-17	514309	25986	3012	Caching, eventually rate-limiting²
2025-11-18	335266	20280	1	Iocaine 2.1 + Rate-limiting
2025-11-19	3183	0	0	Iocaine 3
Bot Logs	Successful	Delayed (429)	Error (5XX)	Measures in place
2025-11-14 (bots)	41388	65517	0	Iocaine 2.1 + Rate-limiting
2025-11-15 (bots)	34190	53403	63	Iocaine 2.1 + Rate-limiting
2025-11-16 (bots)	-	-	-	(no bot-specific logs)
2025-11-17 (bots)	-	-	-	(no bot-specific logs)
2025-11-18 (bots)	390013	0	13	Iocaine 2.1 + Rate-limiting
2025-11-19 (bots)	731593	0	0	Iocaine 3

Table 1: Number of Queries Per Day

(Commands used to generate Table 1)

Assuming your log file is git-access-2025-11-14.log.gz:

zcat git-access-2025-11-14.log.gz | grep '" 200 ' | wc -l
zcat git-access-2025-11-14.log.gz | grep '" 429 ' | wc -l

Without spoiling too much, caching was an utter failure, and the improvement i measurement by manually rate-limiting a set of IPs (from Huawei Cloud and Alibaba) on the Linux repository only helped so much. When all protections dropped, my server became so unresponsive that backend errors (usually timeouts) spiked. Error also happened with caching, when nginx encountered an issue when buffering a reply. Overall, caching encouraged more queries overall.

Once Iocaine was deployed, the vast majority of queries were routed away from the backend, with no errors reported, and no delaying because all of the IPs i manually rate-limited were caught by Iocaine instead.

Out of all these queries, 117.64.70.34 is the most common source of requests, with 226023 total queries originating from the ChinaNet-Backbone ASN (AS4134). It is followed by 136.243.228.193 (13849 queries), an IP from Hetzner whose hostname ironically resolves to crawling-gateway-136-243-228-193.dataforseo.com. Then, 172.17.0.3 the uptime prober of VC Status with 6908 queries, and 74.7.227.127, an IP from Microsoft’s AS 8075 (6117 queries).

Day	Unique IP Count
2025-11-14	16461
2025-11-15	18639
2025-11-16	41712
2025-11-17	47252
2025-11-18	22480
2025-11-19	14230

Table 2: Grand Total of Unique IPs Querying the Forge

(Commands used to generate Table 2)

Assuming your log files are called *git-access-2025-11-14.log.gz:

zcat \*git-access-2025-11-14.log.gz | awk '{ print $1 }' | sort | uniq -c | wc -l

On the two days where restrictions were lifted or there was only caching, the amount of unique IPs querying the forge doubled. The more you facilitate the work of these crawlers, the more they are going to pound you. They will always try and get more out of your server than you are capable of providing.

Day	Top 1	Top 2	Top 3	Top 4	Top 5
2025-11-14	(226089) - `/reibooru/reibooru`	(40189) - `/Lymkwi/linux`	(1454) - `/`	(1405) - `/rail`	(1174) - `/Soblow/indi-hugo`
2025-11-15	(35163) - `/Lymkwi/linux`	(18952) - `/vc-archival/youtube-dl`	(4197) - `/vc-archival/youtube-dl-original`	(1655) - `/reibooru/reibooru`	(1635) - `/Lymkwi/gr-gsm`
2025-11-14 (bots)	(40189) - `/Lymkwi/linux`	(270) - `/oror/necro`	(79) - `/Lymkwi/[REDACTED]`³	(55) - `/vc-archival/youtube-dl`	(52) - `/oror/asm`
2025-11-15 (bots)	(32895) - `/Lymkwi/linux`	(260) - `/oror/necro`	(193) - `/Lymkwi/gr-gsm`	(95) - `/Lymkwi/[REDACTED]`³	(48) - `/alopexlemoni/GenderDysphoria.fyi`
2025-11-16	(72687) - `/vc-archival/youtube-dl`	(23028) - `/Lymkwi/linux`	(16779) - `/vc-archival/youtube-dl-original`	(5390) - `/reibooru/reibooru`	(3585) - `/Lymkwi/gr-gsm`
2025-11-17	(361632) - `/vc-archival/youtube-dl`	(74048) - `/vc-archival/youtube-dl-original`	(18136) - `/reibooru/reibooru`	(13147) - `/oror/necro`	(12921) - `/alopexlemoni/GenderDysphoria.fyi`
2025-11-18	(227019) - `/vc-archival/youtube-dl`	(46004) - `/vc-archival/youtube-dl-original`	(12644) - `/alopexlemoni/GenderDysphoria.fyi`	(12624) - `/reibooru/reibooru`	(7712) - `/oror/necro`
2025-11-18 (bots)	(261346) - `/vc-archival/youtube-dl`	(43923) - `/vc-archival/youtube-dl-original`	(20195) - `/alopexlemoni/GenderDysphoria.fyi`	(18808) - `/reibooru/reibooru`	(10134) - `/oror/necro`
2025-11-19	(1418) - `/`	(1248) - `/rail`	(356) - `/Soblow`	(31) - `/assets/img`	(25) - `/Soblow/IndigoDen`
2025-11-19 (bots)	(448626) - `/vc-archival/youtube-dl`	(73164) - `/vc-archival/youtube-dl-original`	(39107) - `/reibooru/reibooru`	(37107) - `/alopexlemoni/GenderDysphoria.fyi`	(25921) - `/vc-archival/YSLua`

Table 3: Top 5 Successful Repo/Account/Page Hits Per Day

(Commands used to generate Table 3)

Assuming you want data for the log file called git-access-2025-11-14.log.gz:

 zcat git-access-2025-11-14.log.gz | grep '" 200 ' | awk '{ print $7 }' \
    | cut -d/ -f -3 | sort | uniq -c | sort -n \
    | tail -n 5 | tac

Big repositories with a lot of commits and a lot of files are a bountiful resource for the crawlers. Once they enter those, they will take ages to leave, at least because of the sheer amount of pages that can be generated by following the links of a repository.

Most legitimate traffic seems to be either fetching profiles (a couple of my users have their profiles listed in their fediverse bios) or the root page of my forge.

	2025-11-14 (all)	2025-11-15 (all)	2025-11-16 (all)
Top 1	(8532) - AS136907 (Huawei Clouds)	(8537) - AS136907 (Huawei Clouds)	(8535) - AS136907 (Huawei Clouds)
Top 2	(2142) - AS45899 (VNPT Corp)	(2107) - AS45899 (VNPT Corp)	(4002) - AS212238 (Datacamp Limited)
Top 3	(803) - AS153671 (Liasail Global Hongkong Limited)	(895) - AS153671 (Liasail Global Hongkong Limited)	(3504) - AS9009 (M247 Europe SRL)
Top 4	(555) - AS5065 (Bunny Communications)	(765) - AS45102 (Alibaba US Technology Co., Ltd.)	(3206) - AS3257 (GTT Communications)
Top 5	(390) - AS21859 (Zenlayer Inc)	(629) - AS5065 (Bunny Communications)	(2874) - AS45899 (VNPT Corp)

Table 4: Top ASN Per Day For The First Three Days, Per Unique IP Count

(Commands used to generate Table 4)

For this, i needed a database of IP-to-ASN data. i got one from IPInfo by registering for a free account and using their web API. i first scripted a mapping of unique IP addresses to AS number. For example, for the log file bot-git-access-2025-11-18.log.gz:

while read ip; do
    ASN=$(curl -qfL api.ipinfo.io/lite/$ip?token=<my token> | jq -r .asn);
    printf "$ip $ASN\n" | tee -a 2025-11-18-bot.ips.txt;
done < <(zcat bot-git-access-2025-11-18.log.gz | awk '{ print $1 }' | sort | uniq)

Then, with this map, i run:

cat 2025-11-18-bot.ips.txt | cut -d' ' -f 2 | sort | uniq -c | sort -n | tail -n 5

So my largest hits are from Huawei Clouds (VPS provider), VPNT (a Vietnamese mobile and home ISP), Liasail Global HK Limited (a VPS/“AI-powering service” provider), Bunny Communications LLC (a broadband ISP for residential users), and Zenlayer (CDN/Cloud infrastructure provider). When i lifted all protections, Datacamp Limited (a VPS provider), GTT Communications (some sort of bullshit-looking ISP⁴ who, i have been informed, is in fact a backbone operator), and M247 Europe SRL (a hosting provider) suddenly appeared. If memory serves me right, Datacamp, GTT and M247 were also companies i had flagged during my initial investigation in summer 2024, and added to the manually blocked/limited IPs alongside all of Huawei Cloud and Alibaba.

Interestingly, both Liasail and Zenlayer mention that they “Power AI” on their front page. They sure do. Worryingly, VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.

# The Protection Measures

i have one goal, and one constraint. My goal is that i need to protect the forge as much as possible, by means of either blocking bots or offloading the cost to my VPS provider (whose electricity i do not pay for). My only constraint: i was not going to deploy a proof-of-work-based captcha system such as Anubis. There are two reasons for these constraints:

i personally find that forcing your visitors to have to expand more computational power to prove they’re not a scraper is bad praxis. There are devices out there that legitimately want that access, but have limited computational power or features. And, yeah, there are multiple types of challenges, some of which take low-power devices into account or even those that cannot run JavaScript, but,
Scrapers can easily bypass Anubis. It’s not a design flaw. Anubis is harm reduction, not pixie dust.

i tried layers of solutions:

caching on the reverse proxy
Iocaine 2 with no classifiers, which generates garbage in reply to any query you send it
Manually redirecting IPs and rate-limiting them
Deploying Iocaine 3, with its classifiers (Nam-Shub-of-Enki)

## Reverse-Proxy Caching

i have a confession to make: i never realized that nginx did not cache anything by default. That realization promptly came with the other realization that caching things correctly is hard. i may, some day, write about my experience of protecting a service that posted links to itself on the fediverse, so that it wouldn’t slow to a crawl for ten minutes after every post.

As for the rest of these, i will be showing my solution in nginx. You can, almost certainly, figure out a way of doing exactly the same thing with any other decent reverse proxy software.

To create a cache for my forge, i add the following line to /etc/nginx.conf:

proxy_cache_path /var/cache/nginx/forge/ levels=1:2 keys_zone=forgecache:100m;

That will create a 2-level cache called forgecache that will hold ~~100MB of data~~ 100MB of in-memory index data for a cache located at /var/cache/nginx/forge. i create the directory and make www-data its owner and group.

In /etc/nginx/sites-enabled/vcinfo-git.conf, where my git forge’s site configuration sits, i have a location block that serves the whole root of the service, which i modify thusly:

location / {
    proxy_cache forgecache;
    proxy_buffering on;
    proxy_cache_valid any 1h;
    add_header X-Cached $upstream_cache_status;
    expires 1h;
    proxy_ignore_headers "Set-Cookie";
    proxy_hide_header "Set-Cookie";

    # more stuff...
}

That configuration does several things: it turns on caching and buffering at the proxy (proxy_buffering), telling it to use forgecache (proxy_cache) and keep any page valid for an hour (proxy_cache_valid). It also adds a cookie that will let you debug whether or not a query hit or missed the cache (add_header). The expires directive adds headers telling your visitor’s browser that the content they cache will also expire in an hour (expires). Finally, the cache ignores any response header that sets a cookie (proxy_ignore_headers, proxy_hide_header), to attempt to remove any page that could be customized for a user once they log in.

EDIT 2026-03-19

As it turns out, proxy_ignore_headers on Set-Cookie is a catastrophic idea. Digging into the documentation, you can find out after two or three indirections that, by default, caching is disabled if the Set-Cookie header is present. By ignoring it, i essentially told nginx, “hey, please, cache my logged in pages!!”. The correct way to do what i want here is to remove proxy_ignore_headers and proxy_hide_header (it’s useless on Set-Cookie by default - again, nginx will not cache those pages). The result with that configuration is the same however: caching pages when bots only access each URL once is useless. Also, you do not need 100m of cache index data. Holy fuck. That is way too much. 5 Megabytes is already enough.

The result? Caching was a disaster, predictably so. Caching works when the same resource is repeatedly queried, like with page assets, JavaScript, style sheets, etc. In this case, the thousands of actors querying my forge are coordinated, somehow, never (or rarely) query the same resource twice, and only download the raw HTML of the web pages.

Worse, caching messed up the display of authenticated pages. The snippets above are not enough to delineate between an authenticated session and an unauthenticated one, and it broke my forge so badly that i had to disable caching and enable the next layer early on 2025-11-17, or i just could not use my forge.

## Rate-Limiting on the Proxy

The next layer of protection simply consisted in enabling a global rate-limit on the most-hit repositories:

limit_req_zone wholeforge zone=wholeforge:10m rate=3r/s;

server {
    // ...
	location ^~ (/alopexlemoni/GenderDysphoria.fyi|/oror/necro|/Lymkwi/linux|/vc-archival/youtube-dl-original|/reibooru/reibooru) {
		proxy_set_header Host $host;
		proxy_set_header X-Real-IP $remote_addr;
		proxy_max_temp_file_size 2048m;

		limit_req zone=wholeforge nodelay;

		proxy_pass http://<my actual upstream>/;
	}
}

This was achieved in two directives. The first one, limit_req_zone, sits outside the server {} block and defines a zone called wholeforge that stores 10MB of state data and limits to 3 requests per second.

When this was in place, however, actually accessing the Linux repository as a normal user (or any of the often-hit repositories) became a nightmare of waiting and request timeouts.

## Manually Redirecting to a Garbage Generator

Because caching was (predictably) useless, and rate-limiting was hindering me as well, i re-enabled the initial setup that was in place before my experiments: manually redirecting queries to a garbage generator (in this case, an old version of Iocaine). It’s largely based on my initial setup following this tutorial in french.

For the purpose of this part, you do not have to know what Iocaine does precisely. In the next section, i will present my current and final setup, with an updated Iocaine that also includes a classifier to decide which queries are bots and which are regular users. For now, i will present the version where i manually chose who to return garbage to based on IP addresses.

As a little bonus, it will also include rate-limiting of those garbage-hungry bots.

i add a file called /etc/nginx/snippets/block_bots.conf which contains:

if ($bot_user_agent) {
    rewrite ^ /deflagration$request_uri;
}
if ($bot_ip) {
    rewrite ^ /deflagration$request_uri;
}
location /deflagration {
    limit_req zone=bots nodelay;
    proxy_set_header Host $host;
    proxy_pass <garbage upstream>;
}

This will force any query categorized as bot_user_agent or bot_ip to be routed through to a different upstrea which serves garbage. That upstream is also protected by rate-limiting on a zone called bots which is defined in the next bit of code. This snippet is actually meant to be included in your server {} block using the include directive.

i then add the following in /etc/nginx/conf.d/bots.conf:

map $http_user_agent $bot_user_agent {
    default 0;

    # from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt
    ~*amazonbot 1;
    ~*anthropic-ai  1;
    ~*applebot  1;
    ~*applebot-extended 1;
    ~*brightbot 1;
    ~*bytespider  1;
    ~*ccbot 1;
    ~*chatgpt-user  1;
    ~*claude-web  1;
    ~*claudebot 1;
    ~*cohere-ai 1;
    ~*cohere-training-data-crawler  1;
    ~*crawlspace  1;
    ~*diffbot 1;
    ~*duckassistbot 1;
    ~*facebookbot 1;
    ~*friendlycrawler 1;
    ~*google-extended 1;
    ~*googleother 1;
    ~*googleother-image 1;
    ~*googleother-video 1;
    ~*gptbot  1;
    ~*iaskspider  1;
    ~*icc-crawler 1;
    ~*imagesiftbot  1;
    ~*img2dataset 1;
    ~*isscyberriskcrawler 1;
    ~*kangaroo  1;
    ~*meta-externalagent  1;
    ~*meta-externalfetcher  1;
    ~*oai-searchbot 1;
    ~*omgili  1;
    ~*omgilibot 1;
    ~*pangubot  1;
    ~*perplexitybot 1;
    ~*petalbot  1;
    ~*scrapy  1;
    ~*semrushbot-ocob 1;
    ~*semrushbot-swa  1;
    ~*sidetrade 1;
    ~*timpibot  1;
    ~*velenpublicwebcrawler 1;
    ~*webzio-extended 1;
    ~*youbot  1;

    # Add whatever other pattern you want down here
}

geo $bot_ip {
    default 0;

    # Add your IP ranges here
}

# Rate-limiting setup for bots
limit_req_zone bots zone=bots:30m rate=1r/s;

# Return 429 (Too Many Requests) to slow them down
limit_req_status 429;

That bit of configuration does a mapping between the client IP and a variable called bot_ip, and the client’s user agent and a variable called bot_user_agent. When a known pattern listed in those blocks is found, the corresponding variable is flipped to the provided value (here, 1). Otherwise, it stays 0. Then, we define the rate-limiting zone that is used to slow down the bots so they don’t feed on slop too fast. You will then need to install the http-geoip2 nginx module (on Debian-based distributions, something like apt install libnginx-mod-http-geoip2 will do).

Once that is done, add the following line to the server block of every site you want to protect:

include /etc/nginx/snippets/block_bots.conf;

And when you feel confident enough, roll a nginx -t and reload the unit for nginx.

Now, if you’re using caddy or any other reverse proxy, there are probably similar mechanisms available. You can go and peruse the documentation of Iocaine, or look online for specific tutorials that, i am sure, other people have made better than i would.

Immediately after enabling it, and shoving all the IPs from Alibaba Cloud and Huawei Cloud in the bot config file, the activity slowed down on my server. Power usage went down to ~180W, CPU usage to rougly 60%, and it stopped making a hellish noise.

As the stats showed earlier, however, a lot of traffic was still hitting the server itself. Even weirder, there were still occasional spikes, every 3 hour, that lasted about one and a half hour, where the server would whirr and forgejo suffocate again.

Bots were still hitting my server, and there was no clear source for it.

## Automatically Classifying Bots and Poisoning Them: Iocaine and Nam-Shub-of-Enki

So far, the steps i showed so far help when a single IP is hammering at your forge, or when someone is clearly scraping you from an Autonomous System that you do not mind blocking. Sadly, as i’ve showed above in Table 4, a surprising amount of scraping comes from broadband addresses. i can assemble lists of IPs as big as i want, or block entire ASNs, but i would love to have a per-query way of determining if a query looks legitimate.

The next steps of protection will rely on categorizing a source IP based on its the credibility of its user agent. This mechanism is largely based on the documentation for Iocaine 3.x. We finally get to talk about Iocaine!

Iocaine is a tool that traps scrapers in a maze of meaningless pages that endlessly lead to more meaningless pages. The content of these pages is generated using a Markov chain, based on a corpus of texts given to the software. Iocaine (specifically all versions after 3 at least⁵) is a middleware, in the sense that it works by being placed on the line between your reverse proxy and the service. Your reverse proxy will first begin by redirecting traffic to Iocaine, and, if Iocaine deems a query legitimate, it will return a 421 Misdirected Request back at your reverse-proxy. The latter must then catch it, and use the real upstream as a fallback. If Iocaine’s Nam-Shub-of-Enki⁶ decides query came from a bogus or otherwise undesirable source, it will happily reply 200 OK and send generated garbage.

My setup lodges Iocaine 3 between nginx and my forge, following the Iocaine documentation to use the container version. i recommend you follow it, and then add the next little things to enable categorization statistics, and prevent the logging they’re based on from blowing up your storage:

In etc/config.d/03-nam-shub-of-enki.kdl, change the logging block to:

logging {
    enable #true
    classification {
        enable #true
    }
}

In docker-compose.yaml, add the following bits to limit classification logging to 50MB:

services:
  iocaine:
    # The things you already have here...
    # ...
    env:
      - RUST_LOG=iocaine=info
    logging:
      driver: "json-file"
      options:
        max-size: "50m"

My checks block in Nam-Shub-of-Enki is as such:

checks {
    disable cgi-bin-trap

    asn {
        database-path "/data/ipinfo_lite.mmdb"
        asns "45102" "136907"
    }
    ai-robots-txt {
        path "/data/ai.robots.txt-robots.json"
    }
    generated-urls {
        identifiers "deflagration"
    }
    big-tech {
        enable #true
    }
    commercial_scrapers {
        enable #true
    }
}

I snatched a copy of the latest ipinfo ASN database for free and blocked AS52102 (Alibaba) and AS136907 (Huawei Clouds).

On 2025-11-18 at 00:00:29 UTC+1, i enabled Iocaine with the Nam-Shub-of-Enki classifier in front of my whole forge. Immediately, my server was no longer hammered. Power draw went down to just above 160W.

One problem i noticed however, while trying to deploy the artifact for this blog post on my forge, is that Iocaine causes issues when huge PUT/PATCH/POST requests with large bodies are piped through it: it will hang up before the objects are entirely written. i am trying to figure out a way of only redirecting HEAD and GET requests to Iocaine in nginx, like is done in the Caddy example of the Iocaine documentation.

What i ended up settling on requires a bit of variable mapping. At the start of your site configuration, before the server {} block:

map $request_method $upstream_location {
	GET	<iocaine upstream>;
	HEAD	<iocaine upstream>;
	default	<your actual upstream>;
}

map $request_method $upstream_log {
	GET	bot_access;
	HEAD	bot_access;
	default	access;
}

Then, in the block that does the default location, write:

	location / {
	    proxy_cache off;
	    access_log /var/log/nginx/$upstream_log.log combined;
	    proxy_intercept_errors on;
	    error_page 421 = @fallback;
	    proxy_set_header Host $host;
	    proxy_set_header X-Real-IP $remote_addr;
	    proxy_pass http://$upstream_location;
	}

That is, replace the upstream in proxy_pass with the upstream decided by the variable mapping, and, while we’re at it, use $upstream_log to know which log will be the final one for that request. i differentiate between bot_access.log and access.log to gather my statistics, so the difference matters to me. Change the variables to suit the way you do it (or remove it, if you don’t distinguish clients in your log files).

# Monitoring Iocaine

Currently, on 2025-11-30 at 16:33:00 UTC+1, Iocaine has served 38.16GB of garbage. Over the past hour, 152.11MB of such data was thrown back at undesirable visitors. 3.39GB over the past day, 22.22GB over the past week. You can get the snippet that describes my Iocaine-specific Grafana views here.

The vast majority of undesirable queries come from Claude, OpenAI, and Disguised Bots. Claude and OpenAI are absolutely gluttonous, and, once they have access to a ton of pages, they will greedily flock to fetch them like pigeons being fed breadcrumbs laced with strychnine.

Hits by Ruleset on my Grafana

AI bot scrapers (ai.robots.txt) maintain a constant 920~930 query per minute (15-ish QPS) over the 6 domains i have protected with Iocaine, including the forge.

There is also a low hum of a mix of commercial scrapers (~1 request every two second), big tech crawlers (Facebook, Google, etc, about 2QPS or 110 query/min), and, especially, fake browsers.

Classifying fake browsers is where Iocaine really shines, specifically thanks to the classifiers implemented via Nam-Shub-of-Enki. The faked bots classifier detects the likelihood that the user agent reported by the client is bullshit, generated from a list of technologies mashed together. For example, if your client reports a user agent for a set of software that never supported HTTP2, or never actually existed together, or is not even released yet, it will get flagged. Think, for example, Windows NT 4 running Chrome, pretending to be able to do TLS1.3.

The background-noise level of such queries is usually 140~160 queries per minute (or 2~3 QPS). However, notice those spikes in the graph above?

## The Salves of Queries

For a while during my experiments i noticed those pillars of queries. My general nginx statistics would show a sharp increase of connections, with an iniital ramp-up, and a stable-ish plateau lasting about one and a half hour, before suddenly stopping. It would then repeat again, roughly three hours later.

Between October 29th and November 19th, and on November 28th, these spikes would constantly show up. As soon as i got Iocaine statistics running, it would flag all of those queries as faked browsers.

i investigated those spikes in particular, because they baffled and scared me: the regularity with which they probed me, and the sharpness of the ramp-up and halts, made me afraid that someone, somewhere, was organizing thousands of IPs to specifically take turns at probing websites. i have not reached any solid conclusions, beyond the following:

The initial phase of an attack wave begins with a clear exponential ramp-up
The ramp-up stops when the server starts either throwing errors, or the response latency reaches a given threshold
Every wave of attack lasts roughly one hour and a half
An individual IP will often contribute no more than one query, but it can reach 50 to 60 queries per IP
The same 15 or so ASN keep showing up, with five regular leaders in IP count:
1. AS212238: Datacamp Limited
2. AS3257: GTT Communications
3. AS9009: M247 Europe SRL
4. AS203020: HostRoyale Technologies Pvt Ltd
5. AS210906: UAB “Bite Lietuva” (a Lithuanian ISP)

All of those as service providers. My working theory at the moment is that someone registered thousands of cheap servers in many different companies, and are selling access to them as web proxies for scraping and scanning. i will probably write something up later when i have properly investigated that specific phenomenon.

# Conclusion

Self-hosting anything that is deemed “content” openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell.

This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see.

i have to learn to protect myself from industrial actors in order to put anything online, because anything a person makes is valuable, and that value will be sucked dry by every tech giant to be emulsified, liquified, strained, and ultimately inexorably joined in an unholy mesh of learning weights.

This experience has rather profoundly radicalized the way i think about technology. Sanitized content can be chewed on and shat out by companies from training, but their AI tools will never swear. They will never use a slur. They will never have a revolutionary thought. Despite being amalgamation of shit rolled up in the guts of the dying capitalist society, they are sanitized to hell and beyond.

The developer of Iocaine put it best when explaining why Iocaine has absolutely unhinged identifiers (such as SexDungeon, PipeBomb, etc) is that they will all trigger “safeguard” mechanisms in commercial AI tools: absolutely no coding agent will accept analyzing and explaining code where the memory allocator’s free function is called liberate_palestine. i bet that if i described, in graphic details, in the comments of this page, the different ways being a furry intersects with my sexuality, that no commercial scraper would even dare ingest this page.

Fuck tech companies. Fuck “AI”. Fuck the corporate web.

i could’ve put QOS in place to limit that traffic; but that would only have been a bandaid solution, and caused massive congestion because of traffic shaping anyways. ↩
Caching alone proved so inefficient that i had to enable the next layer before the day ended, because the server was on its knees and nothing visibly improved with just caching alone. ↩
“Whoops! i did not know that was still public. Yikes! That’s doxxing me!”, i said, while gathering those stats. ↩ ↩2
Like, seriously. i spent five minutes on they home page trying to figure out if they were a tier-2 ISP, a datacenter infrastructure provider, an IT management company, a hosting company, and i have no idea. Their home page is filled with impenetrable corporate jargon. If you want to know what they do, you have to look them up elsewhere. ↩
Initially, until 2025-11-15, i was using Iocaine 2.1, which required that you manually redirect traffic to it. The classifiers and request handlers that required Iocaine to be in the line of traffic were added in later versions. ↩
See, i don’t really understand to this day of Nam-Shub-of-Enki is a classifier, a request handler, or a set of classifying rules. To me, it’s a classifier framework, but i am not sure. Iocaine is great, but its documentation is a bit all over the place at the moment (and commits a couple cardinal sins of documentation writing, like, for example, mixing documentation for developers with documentation for the users). ↩