In August 2024, one of my roommates and partners messaged the apartment group chat, saying she noticed the internet was slow again at our place, and my forgejo was unable to render any page in under 15 seconds.
i investigated, thinking it would be a trivial little problem to solve. Soon enough, however, i would uncover hundreds of thousands of queries a day from thousands of individual IPs, fetching seemingly-random pages in my forge every single day, all the time.
This post summarizes the practical issues that arose as a result of the onslaught of scrapers eager to download millions of commits off of my forge, and the measures i put in place to limit the damage.
# Why the forge?
In the year 2025, on the web, everything is worth being scraped. Everything that came out of the mind of a human is susceptible to be snatched under the vastest labor theft scheme in the history of mankind. This very article, the second it gets published in any indexable page, will be added to countless datasets meant to train foundational large-language models. My words, your words, have contributed infinitesimal shifts of neural-network weights underpinning the largest, most grotesque accumulation of wealth seen over the lifetime of my parents, grandparents, and their grandparents themselves.
Oh, and forges have a lot of commits. See, if you have a public repository that
is publicly exposed, every file in every folder for every commit will be connected.
Add other options, such as a git blame on a file, and multiply it by the
number of files and commits. Add the raw download link, also multiplied by the
number of commits.
Say, hypothetically, you have a linux repository available, and only with
all the commits in the master branch up to the v6.17 tag from 2025-09-18.
That’s 1,383,738 commits in the range 1da177e4c3f4..e5f0a698b34e. How many
files is that? Well:
count=0;
while read -r rev; do
point=$(git ls-tree -tr $rev | wc -l);
count=$(( $count + $point ));
printf "[%s] %s: %d (tot: %d)\n" $(git log -1 --pretty=tformat:%cs $rev) $rev $point $count;
done < <(git rev-list "1da177e4c3f4..e5f0a698b34e");
printf "Total: $count\n";
i ran this on the 100 commits before v6.17. If you have git ls-tree -tr $rev, you get both files and directories counted. If you replace it with git ls-tree -r $rev only shows files. i got 72024729 files, and 76798658 files and
directories. Running on the whole history of Linux’s master branch yields
78,483,866,182 files, and 83,627,462,277 files and directories.
Now, for a ballpark estimate of the number of pages that can be scraped if you have a copy of Linux, apply the formula:
(Ncommits * Nfiles) * 2 + (Ncommits * Nfilesandfolders) * 2 + Ncommits * 3
That is, applied to my hypothetical Linux repository:
78483866182 * 2 + 83627462277 * 2 + 1383738 * 3 = 324,226,808,132 pages
The *3 accounts for the fact that every file of every commit can be scraped
raw, and git-blame’d. The second part of the
formula considers every single file or folder page. The third part accounts for
the fact that every file of every commit can be diffed with its version of
every commit (in theory). The final component considers every commit summary
page.
That gives, for me, 324 billion 226 million 808 thousand and 132 pages that can
be scraped. From a single repository. Assume that every scraper agent that
enters one of these repositories will also take note of every other link on the
page, and report it so that other agents can scrapes them. These scrapers
effectively act like early 2000s web spiders that crawled the internet to index
it, except they do not care about robots.txt, and they will absolutely keep
scraping new links again and again with no strategy to minimize the cost on
you, as a host.
# The Cost of Scraping
As i am writing the original draft of this section, the longer-term measures i put in place have been removed, so i could gather up-to-date numbers to display how bad the situation is.
i pay for the electricity that powers my git Forge. Okay, actually, one of my roommate does, but we put it on the calc sheet where we keep track of who pays what (when we remember).
At the time i began fighting scrapers, my git forge ran from an old desktop computer plugged in my living room. Now, it is in our home’s rackable server in a virtual machine. i never got to measure differences in power consumption when we got scraped or not scraped on the desktop machine, but i did on the rackable server. If memory serves me right, stopping the wave of scrapers reduced the power draw of the server from ~170W to ~150W.
Right now, with all the hard drives in that server spinning, and every protection off, we are drawing 200W from the power grid on that server. Constantly. By the end of this experiment, me and my roommates will have computed that the difference in power usage caused by scraping costs us ~60 euros a year.
Another tied cost is that the VM that runs the forge is figuratively suffocating
from the amount of queries. Not all queries are born equal as well: requests to
see the blame of a file or a diff between commits incurs a worse cost than
just rendering the front page of a repository. The last huge wave of scraping
left my VM at 99+% usage of 4 CPU cores and 2.5GiB of RAM,
whereas the usual levels i observe are closer to 4% usage of CPUs, and an oscillation
between 1.5GiB and 2GiB of RAM.
As i’m writing this, the VM running forgejo eats 100% of 8 CPU cores.
Additionally, the networking cost is palpable. Various monitoring tools let me see the real-time traffic statistics in our apartment. Before i put the first measures in place to thwart scraping, we could visibly see the traffic coming out of the desktop computer running my forge and out to the internet. My roommates’ complaints that it slowed down the whole internet here were in fact founded: when we had multiple people watching live streams or doing pretty big downloads, they were throttled by the traffic out of the forge.1
The egress data rate of my forge’s VM is at least 4MBps of data (32Mbps). Constantly.
Finally, the human cost: i have spent entire days behind my terminals trying to figure out 1) what the fuck was going on and 2) what the fuck to do about it. i have had conversations with other people who self-host their infrastructure, desperately trying to figure out workable solutions that would not needlessly impact our users. And the funniest detail is: that rackable server is in the living room, directly in front of my bedroom door. It usually purrs like an adorable cat, but, lately, it’s been whirring louder and louder. i can hear it. when i’m trying to sleep.
# Let’s do some statistics.
i was curious to analyze the nginx logs to understand where the traffic came from and what shape it took.
As a study case, we can work on /var/log/nginx/git.vulpinecitrus.info/ from
2025-11-14 to 2025-11-19. Note that on 2025-11-15 at 18:27 UTC, i
stopped the redirection of new agents into the Iocaine crawler maze (see
below). At 19:15 UTC, i removed the nginx request limit zone from the
/Lymkwi/linux/ path. At 19:16 UTC i removed the separation of log files
between IPs flagged as bots, and IPs not flagged as bots.
The three measures i progressively put in place later were: web caching (2025-11-17), manually sending IPs to a garbage generator with a rate-limit (Iocaine 2) (2025-11-14, 15 and 18), and then Iocaine 3 (2025-11-19).
| Common Logs | Successful | Delayed (429) | Error (5XX) | Measures in place |
|---|---|---|---|---|
| 2025-11-14 | 275323 | 66517 | 0 | Iocaine 2.1 + Rate-limiting |
| 2025-11-15 | 71712 | 54259 | 9802 | Iocaine 2.1 + Rate-limiting |
| 2025-11-16 | 140713 | 0 | 65763 | None |
| 2025-11-17 | 514309 | 25986 | 3012 | Caching, eventually rate-limiting2 |
| 2025-11-18 | 335266 | 20280 | 1 | Iocaine 2.1 + Rate-limiting |
| 2025-11-19 | 3183 | 0 | 0 | Iocaine 3 |
| Bot Logs | Successful | Delayed (429) | Error (5XX) | Measures in place |
| 2025-11-14 (bots) | 41388 | 65517 | 0 | Iocaine 2.1 + Rate-limiting |
| 2025-11-15 (bots) | 34190 | 53403 | 63 | Iocaine 2.1 + Rate-limiting |
| 2025-11-16 (bots) | - | - | - | (no bot-specific logs) |
| 2025-11-17 (bots) | - | - | - | (no bot-specific logs) |
| 2025-11-18 (bots) | 390013 | 0 | 13 | Iocaine 2.1 + Rate-limiting |
| 2025-11-19 (bots) | 731593 | 0 | 0 | Iocaine 3 |
(Commands used to generate Table 1)
Assuming your log file is git-access-2025-11-14.log.gz:
zcat git-access-2025-11-14.log.gz | grep '" 200 ' | wc -l
zcat git-access-2025-11-14.log.gz | grep '" 429 ' | wc -l
Without spoiling too much, caching was an utter failure, and the improvement i measurement by manually rate-limiting a set of IPs (from Huawei Cloud and Alibaba) on the Linux repository only helped so much. When all protections dropped, my server became so unresponsive that backend errors (usually timeouts) spiked. Error also happened with caching, when nginx encountered an issue when buffering a reply. Overall, caching encouraged more queries overall.
Once Iocaine was deployed, the vast majority of queries were routed away from the backend, with no errors reported, and no delaying because all of the IPs i manually rate-limited were caught by Iocaine instead.
Out of all these queries, 117.64.70.34 is the most common source of requests,
with 226023 total queries originating from the ChinaNet-Backbone ASN (AS4134).
It is followed by 136.243.228.193 (13849 queries), an IP from Hetzner whose
hostname ironically resolves to
crawling-gateway-136-243-228-193.dataforseo.com. Then, 172.17.0.3 the
uptime prober of VC Status with 6908
queries, and 74.7.227.127, an IP from Microsoft’s AS 8075 (6117 queries).
| Day | Unique IP Count |
|---|---|
| 2025-11-14 | 16461 |
| 2025-11-15 | 18639 |
| 2025-11-16 | 41712 |
| 2025-11-17 | 47252 |
| 2025-11-18 | 22480 |
| 2025-11-19 | 14230 |
(Commands used to generate Table 2)
Assuming your log files are called *git-access-2025-11-14.log.gz:
zcat \*git-access-2025-11-14.log.gz | awk '{ print $1 }' | sort | uniq -c | wc -l
On the two days where restrictions were lifted or there was only caching, the amount of unique IPs querying the forge doubled. The more you facilitate the work of these crawlers, the more they are going to pound you. They will always try and get more out of your server than you are capable of providing.
| Day | Top 1 | Top 2 | Top 3 | Top 4 | Top 5 |
|---|---|---|---|---|---|
| 2025-11-14 | (226089) - /reibooru/reibooru | (40189) - /Lymkwi/linux | (1454) - / | (1405) - /rail | (1174) - /Soblow/indi-hugo |
| 2025-11-15 | (35163) - /Lymkwi/linux | (18952) - /vc-archival/youtube-dl | (4197) - /vc-archival/youtube-dl-original | (1655) - /reibooru/reibooru | (1635) - /Lymkwi/gr-gsm |
| 2025-11-14 (bots) | (40189) - /Lymkwi/linux | (270) - /oror/necro | (79) - /Lymkwi/[REDACTED]3 | (55) - /vc-archival/youtube-dl | (52) - /oror/asm |
| 2025-11-15 (bots) | (32895) - /Lymkwi/linux | (260) - /oror/necro | (193) - /Lymkwi/gr-gsm | (95) - /Lymkwi/[REDACTED]3 | (48) - /alopexlemoni/GenderDysphoria.fyi |
| 2025-11-16 | (72687) - /vc-archival/youtube-dl | (23028) - /Lymkwi/linux | (16779) - /vc-archival/youtube-dl-original | (5390) - /reibooru/reibooru | (3585) - /Lymkwi/gr-gsm |
| 2025-11-17 | (361632) - /vc-archival/youtube-dl | (74048) - /vc-archival/youtube-dl-original | (18136) - /reibooru/reibooru | (13147) - /oror/necro | (12921) - /alopexlemoni/GenderDysphoria.fyi |
| 2025-11-18 | (227019) - /vc-archival/youtube-dl | (46004) - /vc-archival/youtube-dl-original | (12644) - /alopexlemoni/GenderDysphoria.fyi | (12624) - /reibooru/reibooru | (7712) - /oror/necro |
| 2025-11-18 (bots) | (261346) - /vc-archival/youtube-dl | (43923) - /vc-archival/youtube-dl-original | (20195) - /alopexlemoni/GenderDysphoria.fyi | (18808) - /reibooru/reibooru | (10134) - /oror/necro |
| 2025-11-19 | (1418) - / | (1248) - /rail | (356) - /Soblow | (31) - /assets/img | (25) - /Soblow/IndigoDen |
| 2025-11-19 (bots) | (448626) - /vc-archival/youtube-dl | (73164) - /vc-archival/youtube-dl-original | (39107) - /reibooru/reibooru | (37107) - /alopexlemoni/GenderDysphoria.fyi | (25921) - /vc-archival/YSLua |
(Commands used to generate Table 3)
Assuming you want data for the log file called git-access-2025-11-14.log.gz:
zcat git-access-2025-11-14.log.gz | grep '" 200 ' | awk '{ print $7 }' \
| cut -d/ -f -3 | sort | uniq -c | sort -n \
| tail -n 5 | tac
Big repositories with a lot of commits and a lot of files are a bountiful resource for the crawlers. Once they enter those, they will take ages to leave, at least because of the sheer amount of pages that can be generated by following the links of a repository.
Most legitimate traffic seems to be either fetching profiles (a couple of my users have their profiles listed in their fediverse bios) or the root page of my forge.
| 2025-11-14 (all) | 2025-11-15 (all) | 2025-11-16 (all) | |
|---|---|---|---|
| Top 1 | (8532) - AS136907 (Huawei Clouds) | (8537) - AS136907 (Huawei Clouds) | (8535) - AS136907 (Huawei Clouds) |
| Top 2 | (2142) - AS45899 (VNPT Corp) | (2107) - AS45899 (VNPT Corp) | (4002) - AS212238 (Datacamp Limited) |
| Top 3 | (803) - AS153671 (Liasail Global Hongkong Limited) | (895) - AS153671 (Liasail Global Hongkong Limited) | (3504) - AS9009 (M247 Europe SRL) |
| Top 4 | (555) - AS5065 (Bunny Communications) | (765) - AS45102 (Alibaba US Technology Co., Ltd.) | (3206) - AS3257 (GTT Communications) |
| Top 5 | (390) - AS21859 (Zenlayer Inc) | (629) - AS5065 (Bunny Communications) | (2874) - AS45899 (VNPT Corp) |
(Commands used to generate Table 4)
For this, i needed a database of IP-to-ASN data. i got one from
IPInfo by registering for a free account and using their
web API. i first scripted a mapping of unique IP addresses to AS number. For
example, for the log file bot-git-access-2025-11-18.log.gz:
while read ip; do
ASN=$(curl -qfL api.ipinfo.io/lite/$ip?token=<my token> | jq -r .asn);
printf "$ip $ASN\n" | tee -a 2025-11-18-bot.ips.txt;
done < <(zcat bot-git-access-2025-11-18.log.gz | awk '{ print $1 }' | sort | uniq)
Then, with this map, i run:
cat 2025-11-18-bot.ips.txt | cut -d' ' -f 2 | sort | uniq -c | sort -n | tail -n 5
So my largest hits are from Huawei Clouds (VPS provider), VPNT (a Vietnamese mobile and home ISP), Liasail Global HK Limited (a VPS/“AI-powering service” provider), Bunny Communications LLC (a broadband ISP for residential users), and Zenlayer (CDN/Cloud infrastructure provider). When i lifted all protections, Datacamp Limited (a VPS provider), GTT Communications (some sort of bullshit-looking ISP4 who, i have been informed, is in fact a backbone operator), and M247 Europe SRL (a hosting provider) suddenly appeared. If memory serves me right, Datacamp, GTT and M247 were also companies i had flagged during my initial investigation in summer 2024, and added to the manually blocked/limited IPs alongside all of Huawei Cloud and Alibaba.
Interestingly, both Liasail and Zenlayer mention that they “Power AI” on their front page. They sure do. Worryingly, VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.
# The Protection Measures
i have one goal, and one constraint. My goal is that i need to protect the forge as much as possible, by means of either blocking bots or offloading the cost to my VPS provider (whose electricity i do not pay for). My only constraint: i was not going to deploy a proof-of-work-based captcha system such as Anubis. There are two reasons for these constraints:
- i personally find that forcing your visitors to have to expand more computational power to prove they’re not a scraper is bad praxis. There are devices out there that legitimately want that access, but have limited computational power or features. And, yeah, there are multiple types of challenges, some of which take low-power devices into account or even those that cannot run JavaScript, but,
- Scrapers can easily bypass Anubis. It’s not a design flaw. Anubis is harm reduction, not pixie dust.
i tried layers of solutions:
- caching on the reverse proxy
- Iocaine 2 with no classifiers, which generates garbage in reply to any query you send it
- Manually redirecting IPs and rate-limiting them
- Deploying Iocaine 3, with its classifiers (Nam-Shub-of-Enki)
## Reverse-Proxy Caching
i have a confession to make: i never realized that nginx did not cache anything by default. That realization promptly came with the other realization that caching things correctly is hard. i may, some day, write about my experience of protecting a service that posted links to itself on the fediverse, so that it wouldn’t slow to a crawl for ten minutes after every post.
As for the rest of these, i will be showing my solution in nginx. You can,
almost certainly, figure out a way of doing exactly the same thing with any other
decent reverse proxy software.
To create a cache for my forge, i add the following line to /etc/nginx.conf:
proxy_cache_path /var/cache/nginx/forge/ levels=1:2 keys_zone=forgecache:100m;
That will create a 2-level cache called forgecache that will hold 100MB of data 100MB of in-memory index data
for a cache
located at /var/cache/nginx/forge. i create the directory and make www-data
its owner and group.
In /etc/nginx/sites-enabled/vcinfo-git.conf, where my git forge’s site
configuration sits, i have a location block that serves the whole root of the
service, which i modify thusly:
location / {
proxy_cache forgecache;
proxy_buffering on;
proxy_cache_valid any 1h;
add_header X-Cached $upstream_cache_status;
expires 1h;
proxy_ignore_headers "Set-Cookie";
proxy_hide_header "Set-Cookie";
# more stuff...
}
That configuration does several things: it turns on caching and buffering at
the proxy
(proxy_buffering),
telling it to use forgecache
(proxy_cache)
and keep any page valid for an hour
(proxy_cache_valid).
It also adds a cookie that will let you debug whether or not a query hit or
missed the cache (add_header). The expires directive adds headers telling
your visitor’s browser that the content they cache will also expire in an hour
(expires).
Finally, the cache ignores any response header that sets a cookie
(proxy_ignore_headers,
proxy_hide_header),
to attempt to remove any page that could be customized for a user once they log
in.
EDIT 2026-03-19
As it turns out, proxy_ignore_headers on Set-Cookie is a catastrophic idea. Digging into
the documentation, you can find out after two or three indirections that, by default, caching is disabled if the Set-Cookie
header is present. By ignoring it, i essentially told nginx, “hey, please, cache my logged in pages!!”.
The correct way to do what i want here is to remove proxy_ignore_headers and proxy_hide_header (it’s useless on Set-Cookie by default - again, nginx will not
cache those pages). The result with that configuration is the same however: caching pages when bots only access each URL once is useless.
Also, you do not need 100m of cache index data. Holy fuck. That is way too much. 5 Megabytes is already enough.
The result? Caching was a disaster, predictably so. Caching works when the same resource is repeatedly queried, like with page assets, JavaScript, style sheets, etc. In this case, the thousands of actors querying my forge are coordinated, somehow, never (or rarely) query the same resource twice, and only download the raw HTML of the web pages.
Worse, caching messed up the display of authenticated pages. The snippets above
are not enough to delineate between an authenticated session and an
unauthenticated one, and it broke my forge so badly that i had to disable
caching and enable the next layer early on 2025-11-17, or i just could not
use my forge.
## Rate-Limiting on the Proxy
The next layer of protection simply consisted in enabling a global rate-limit on the most-hit repositories:
limit_req_zone wholeforge zone=wholeforge:10m rate=3r/s;
server {
// ...
location ^~ (/alopexlemoni/GenderDysphoria.fyi|/oror/necro|/Lymkwi/linux|/vc-archival/youtube-dl-original|/reibooru/reibooru) {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_max_temp_file_size 2048m;
limit_req zone=wholeforge nodelay;
proxy_pass http://<my actual upstream>/;
}
}
This was achieved in two directives. The first one, limit_req_zone, sits outside
the server {} block and defines a zone called wholeforge that stores 10MB of
state data and limits to 3 requests per second.
When this was in place, however, actually accessing the Linux repository as a normal user (or any of the often-hit repositories) became a nightmare of waiting and request timeouts.
## Manually Redirecting to a Garbage Generator
Because caching was (predictably) useless, and rate-limiting was hindering me as well, i re-enabled the initial setup that was in place before my experiments: manually redirecting queries to a garbage generator (in this case, an old version of Iocaine). It’s largely based on my initial setup following this tutorial in french.
For the purpose of this part, you do not have to know what Iocaine does precisely. In the next section, i will present my current and final setup, with an updated Iocaine that also includes a classifier to decide which queries are bots and which are regular users. For now, i will present the version where i manually chose who to return garbage to based on IP addresses.
As a little bonus, it will also include rate-limiting of those garbage-hungry bots.
i add a file called /etc/nginx/snippets/block_bots.conf which contains:
if ($bot_user_agent) {
rewrite ^ /deflagration$request_uri;
}
if ($bot_ip) {
rewrite ^ /deflagration$request_uri;
}
location /deflagration {
limit_req zone=bots nodelay;
proxy_set_header Host $host;
proxy_pass <garbage upstream>;
}
This will force any query categorized as bot_user_agent or bot_ip to be
routed through to a different upstrea which serves garbage. That upstream is
also protected by rate-limiting on a zone called bots which is defined in the
next bit of code. This snippet is actually meant to be included in your server {}
block using the include directive.
i then add the following in /etc/nginx/conf.d/bots.conf:
map $http_user_agent $bot_user_agent {
default 0;
# from https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt
~*amazonbot 1;
~*anthropic-ai 1;
~*applebot 1;
~*applebot-extended 1;
~*brightbot 1;
~*bytespider 1;
~*ccbot 1;
~*chatgpt-user 1;
~*claude-web 1;
~*claudebot 1;
~*cohere-ai 1;
~*cohere-training-data-crawler 1;
~*crawlspace 1;
~*diffbot 1;
~*duckassistbot 1;
~*facebookbot 1;
~*friendlycrawler 1;
~*google-extended 1;
~*googleother 1;
~*googleother-image 1;
~*googleother-video 1;
~*gptbot 1;
~*iaskspider 1;
~*icc-crawler 1;
~*imagesiftbot 1;
~*img2dataset 1;
~*isscyberriskcrawler 1;
~*kangaroo 1;
~*meta-externalagent 1;
~*meta-externalfetcher 1;
~*oai-searchbot 1;
~*omgili 1;
~*omgilibot 1;
~*pangubot 1;
~*perplexitybot 1;
~*petalbot 1;
~*scrapy 1;
~*semrushbot-ocob 1;
~*semrushbot-swa 1;
~*sidetrade 1;
~*timpibot 1;
~*velenpublicwebcrawler 1;
~*webzio-extended 1;
~*youbot 1;
# Add whatever other pattern you want down here
}
geo $bot_ip {
default 0;
# Add your IP ranges here
}
# Rate-limiting setup for bots
limit_req_zone bots zone=bots:30m rate=1r/s;
# Return 429 (Too Many Requests) to slow them down
limit_req_status 429;
That bit of configuration does a mapping between the client IP and a variable
called bot_ip, and the client’s user agent and a variable called
bot_user_agent. When a known pattern listed in those blocks is found, the
corresponding variable is flipped to the provided value (here, 1). Otherwise,
it stays 0. Then, we define the rate-limiting zone that is used to slow down
the bots so they don’t feed on slop too fast. You will then need to install the
http-geoip2 nginx module (on Debian-based distributions, something like apt install libnginx-mod-http-geoip2 will do).
Once that is done, add the following line to the server block of every site
you want to protect:
include /etc/nginx/snippets/block_bots.conf;
And when you feel confident enough, roll a nginx -t and reload the unit for
nginx.
Now, if you’re using caddy or any other reverse proxy, there are probably
similar mechanisms available. You can go and peruse the documentation of Iocaine,
or look online for specific tutorials that, i am sure, other people have made
better than i would.
Immediately after enabling it, and shoving all the IPs from Alibaba Cloud and Huawei Cloud in the bot config file, the activity slowed down on my server. Power usage went down to ~180W, CPU usage to rougly 60%, and it stopped making a hellish noise.
As the stats showed earlier, however, a lot of traffic was still hitting the server itself. Even weirder, there were still occasional spikes, every 3 hour, that lasted about one and a half hour, where the server would whirr and forgejo suffocate again.
Bots were still hitting my server, and there was no clear source for it.
## Automatically Classifying Bots and Poisoning Them: Iocaine and Nam-Shub-of-Enki
So far, the steps i showed so far help when a single IP is hammering at your forge, or when someone is clearly scraping you from an Autonomous System that you do not mind blocking. Sadly, as i’ve showed above in Table 4, a surprising amount of scraping comes from broadband addresses. i can assemble lists of IPs as big as i want, or block entire ASNs, but i would love to have a per-query way of determining if a query looks legitimate.
The next steps of protection will rely on categorizing a source IP based on its the credibility of its user agent. This mechanism is largely based on the documentation for Iocaine 3.x. We finally get to talk about Iocaine!
Iocaine is a tool that traps scrapers in a maze of meaningless pages that
endlessly lead to more meaningless pages. The content of these pages is
generated using a Markov chain, based on a corpus of texts given to the
software. Iocaine (specifically all versions after 3 at least5) is a middleware, in
the sense that it works by being placed on the line between your reverse proxy
and the service. Your reverse proxy will first begin by redirecting traffic to
Iocaine, and, if Iocaine deems a query legitimate, it will return a 421 Misdirected Request back at your reverse-proxy. The
latter must then catch it, and use the real upstream as a fallback. If
Iocaine’s Nam-Shub-of-Enki6 decides query came from a bogus or otherwise undesirable source, it
will happily reply 200 OK and send generated garbage.
My setup lodges Iocaine 3 between nginx and my forge, following the Iocaine documentation to use the container version. i recommend you follow it, and then add the next little things to enable categorization statistics, and prevent the logging they’re based on from blowing up your storage:
- In
etc/config.d/03-nam-shub-of-enki.kdl, change the logging block to:
logging {
enable #true
classification {
enable #true
}
}
- In
docker-compose.yaml, add the following bits to limit classification logging to 50MB:
services:
iocaine:
# The things you already have here...
# ...
env:
- RUST_LOG=iocaine=info
logging:
driver: "json-file"
options:
max-size: "50m"
My checks block in Nam-Shub-of-Enki is as such:
checks {
disable cgi-bin-trap
asn {
database-path "/data/ipinfo_lite.mmdb"
asns "45102" "136907"
}
ai-robots-txt {
path "/data/ai.robots.txt-robots.json"
}
generated-urls {
identifiers "deflagration"
}
big-tech {
enable #true
}
commercial_scrapers {
enable #true
}
}
I snatched a copy of the latest ipinfo ASN database for free and blocked AS52102 (Alibaba) and AS136907 (Huawei Clouds).
On 2025-11-18 at 00:00:29 UTC+1, i enabled Iocaine with the Nam-Shub-of-Enki classifier in front of my whole forge. Immediately, my server was no longer hammered. Power draw went down to just above 160W.
One problem i noticed however, while trying to deploy the artifact for this
blog post on my forge, is that Iocaine causes issues when huge PUT/PATCH/POST
requests with large bodies are piped through it: it will hang up before the
objects are entirely written. i am trying to figure out a way of only redirecting
HEAD and GET requests to Iocaine in nginx, like is done in the Caddy example
of the Iocaine documentation.
What i ended up settling on requires a bit of variable mapping. At the start of
your site configuration, before the server {} block:
map $request_method $upstream_location {
GET <iocaine upstream>;
HEAD <iocaine upstream>;
default <your actual upstream>;
}
map $request_method $upstream_log {
GET bot_access;
HEAD bot_access;
default access;
}
Then, in the block that does the default location, write:
location / {
proxy_cache off;
access_log /var/log/nginx/$upstream_log.log combined;
proxy_intercept_errors on;
error_page 421 = @fallback;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_pass http://$upstream_location;
}
That is, replace the upstream in proxy_pass with the upstream decided by the
variable mapping, and, while we’re at it, use $upstream_log to know which log
will be the final one for that request. i differentiate between bot_access.log
and access.log to gather my statistics, so the difference matters to me. Change
the variables to suit the way you do it (or remove it, if you don’t distinguish
clients in your log files).
# Monitoring Iocaine
Currently, on 2025-11-30 at 16:33:00 UTC+1, Iocaine has served 38.16GB of garbage. Over the past hour, 152.11MB of such data was thrown back at undesirable visitors. 3.39GB over the past day, 22.22GB over the past week. You can get the snippet that describes my Iocaine-specific Grafana views here.
The vast majority of undesirable queries come from Claude, OpenAI, and Disguised Bots. Claude and OpenAI are absolutely gluttonous, and, once they have access to a ton of pages, they will greedily flock to fetch them like pigeons being fed breadcrumbs laced with strychnine.

AI bot scrapers (ai.robots.txt) maintain a constant 920~930 query per minute
(15-ish QPS) over the 6 domains i have protected with Iocaine, including the
forge.
There is also a low hum of a mix of commercial scrapers (~1 request every two second), big tech crawlers (Facebook, Google, etc, about 2QPS or 110 query/min), and, especially, fake browsers.
Classifying fake browsers is where Iocaine really shines, specifically thanks to the classifiers implemented via Nam-Shub-of-Enki. The faked bots classifier detects the likelihood that the user agent reported by the client is bullshit, generated from a list of technologies mashed together. For example, if your client reports a user agent for a set of software that never supported HTTP2, or never actually existed together, or is not even released yet, it will get flagged. Think, for example, Windows NT 4 running Chrome, pretending to be able to do TLS1.3.
The background-noise level of such queries is usually 140~160 queries per minute (or 2~3 QPS). However, notice those spikes in the graph above?
## The Salves of Queries
For a while during my experiments i noticed those pillars of queries. My general nginx statistics would show a sharp increase of connections, with an iniital ramp-up, and a stable-ish plateau lasting about one and a half hour, before suddenly stopping. It would then repeat again, roughly three hours later.
Between October 29th and November 19th, and on November 28th, these spikes would constantly show up. As soon as i got Iocaine statistics running, it would flag all of those queries as faked browsers.
i investigated those spikes in particular, because they baffled and scared me: the regularity with which they probed me, and the sharpness of the ramp-up and halts, made me afraid that someone, somewhere, was organizing thousands of IPs to specifically take turns at probing websites. i have not reached any solid conclusions, beyond the following:
- The initial phase of an attack wave begins with a clear exponential ramp-up
- The ramp-up stops when the server starts either throwing errors, or the response latency reaches a given threshold
- Every wave of attack lasts roughly one hour and a half
- An individual IP will often contribute no more than one query, but it can reach 50 to 60 queries per IP
- The same 15 or so ASN keep showing up, with five regular leaders in IP count:
- AS212238: Datacamp Limited
- AS3257: GTT Communications
- AS9009: M247 Europe SRL
- AS203020: HostRoyale Technologies Pvt Ltd
- AS210906: UAB “Bite Lietuva” (a Lithuanian ISP)
All of those as service providers. My working theory at the moment is that someone registered thousands of cheap servers in many different companies, and are selling access to them as web proxies for scraping and scanning. i will probably write something up later when i have properly investigated that specific phenomenon.
# Conclusion
Self-hosting anything that is deemed “content” openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell.
This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see.
i have to learn to protect myself from industrial actors in order to put anything online, because anything a person makes is valuable, and that value will be sucked dry by every tech giant to be emulsified, liquified, strained, and ultimately inexorably joined in an unholy mesh of learning weights.
This experience has rather profoundly radicalized the way i think about technology. Sanitized content can be chewed on and shat out by companies from training, but their AI tools will never swear. They will never use a slur. They will never have a revolutionary thought. Despite being amalgamation of shit rolled up in the guts of the dying capitalist society, they are sanitized to hell and beyond.
The developer of Iocaine put it best
when explaining why Iocaine has absolutely unhinged identifiers
(such as SexDungeon, PipeBomb, etc) is that they will all trigger “safeguard”
mechanisms in commercial AI tools: absolutely no coding agent will accept
analyzing and explaining code where the memory allocator’s free function is
called liberate_palestine. i bet that if i described, in graphic details, in
the comments of this page, the different ways being a furry intersects with my
sexuality, that no commercial scraper would even dare ingest this page.
Fuck tech companies. Fuck “AI”. Fuck the corporate web.
-
i could’ve put QOS in place to limit that traffic; but that would only have been a bandaid solution, and caused massive congestion because of traffic shaping anyways. ↩
-
Caching alone proved so inefficient that i had to enable the next layer before the day ended, because the server was on its knees and nothing visibly improved with just caching alone. ↩
-
“Whoops! i did not know that was still public. Yikes! That’s doxxing me!”, i said, while gathering those stats. ↩ ↩2
-
Like, seriously. i spent five minutes on they home page trying to figure out if they were a tier-2 ISP, a datacenter infrastructure provider, an IT management company, a hosting company, and i have no idea. Their home page is filled with impenetrable corporate jargon. If you want to know what they do, you have to look them up elsewhere. ↩
-
Initially, until 2025-11-15, i was using Iocaine 2.1, which required that you manually redirect traffic to it. The classifiers and request handlers that required Iocaine to be in the line of traffic were added in later versions. ↩
-
See, i don’t really understand to this day of Nam-Shub-of-Enki is a classifier, a request handler, or a set of classifying rules. To me, it’s a classifier framework, but i am not sure. Iocaine is great, but its documentation is a bit all over the place at the moment (and commits a couple cardinal sins of documentation writing, like, for example, mixing documentation for developers with documentation for the users). ↩