tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: cvsweb anti-bot protection (was: Retrieving MAC address from struct ifnet)
On Thu, Jul 03, 2025 at 11:30:48AM +0200, Jörg Sonnenberger wrote:
> On 7/3/25 6:23 AM, Constantine A. Murenin wrote:
> > Can you really blame kids for looking at all 5000 links from a single
> > file, when you give them 5000 links to start with? Maybe start by not
> > giving the 5000 unique links from a single file, and implement caching
> > / throttling? How could you know there's nothing interesting in there
> > if you don't visit it all for a few files first?
>
> Are you intentionally misrepresenting the problem?
>
> > These AIs literally behave the exact same way as humans; they're
> > simply dumber and more persistent. The way CVSweb is designed, it's
> > easily DoS'able with the default `wget -r` and `wget --recursive` from
> > probably like 20 years ago?
>
> This is complete BS. "wget -r" uses a single connection (at any point in
> time). It uses a consistent source address. It actually honors robots.txt by
> default. None of that applies to the current generation of AI scrapers:
>
> (1) They have no effective rate limiting mechanism on the origin side.
> (2) They are intentionally distributing requests to avoid server side rate
> limits.
> (3) The combination of the two makes most caching useless.
> (3) They (intentionally or maliciously) do not honor robots.txt.
> (4) They are intentionally faking the user agent.
I have heard these claims a few times, but don't think I have seen any
more in-depth analysis about these - do you happen to have a link with
a more detailed analysis?
Personally, I am seeing gptbot crawling at a rate of up to about 1
request per second. On the other hand, I have seen Scrapy-based
crawlers hitting my web sites at full speed over multiple concurrent
connections, but I am not sure these are connected to the AI scrapers.
Christof
--
https://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org
Home |
Main Index |
Thread Index |
Old Index