this post was submitted on 24 May 2025
1462 points (99.1% liked)

Science Memes

14653 readers
2296 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.



Rules

  1. Don't throw mud. Behave like an intellectual and remember the human.
  2. Keep it rooted (on topic).
  3. No spam.
  4. Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.



Research Committee

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago
MODERATORS
 
top 50 comments
sorted by: hot top controversial new old
[–] buddascrayon@lemmy.world 1 points 3 hours ago

What if we just fed TimeCube into the AI models. Surely that would turn them inside out in no time flat.

[–] infinitesunrise@slrpnk.net 7 points 6 hours ago (2 children)

OK but why is there a vagina in a petri dish

[–] buddascrayon@lemmy.world 3 points 3 hours ago

I believe that's a close-up of the inside of a pitcher plant. Which is a plant that sits there all day wafting out a sweet smell of food, waiting around for insects to fall into its fluid filled "belly" where they thrash around fruitlessly until they finally die and are dissolved, thereby nourishing the plant they were originally there to prey upon.

Fitting analogy, no?

[–] underline960@sh.itjust.works 11 points 6 hours ago

I was going to say something snarky and stupid, like "all traps are vagina-shaped," but then I thought about venus fly traps and bear traps and now I'm worried I've stumbled onto something I'm not supposed to know.

[–] antihumanitarian@lemmy.world 24 points 10 hours ago (1 children)

Some details. One of the major players doing the tar pit strategy is Cloudflare. They're a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.

Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they're a cheap way to get training data. If you make a non zero portion of training data poisonous you'd have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.

So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.

[–] fossilesque@mander.xyz 7 points 8 hours ago

The fact the internet runs on lava lamps makes me so happy.

[–] Novocirab@feddit.org 6 points 8 hours ago* (last edited 7 hours ago) (1 children)

There should be a federated system for blocking IP ranges that other server operators within a chain of trust have already identified as belonging to crawlers. A bit like fediseer.com, but possibly more decentralized.

(Here's another advantage of Markov chain maze generators like Nepenthes: Even when crawlers recognize that they have been served garbage and they delete it, one still has obtained highly reliable evidence that the requesting IPs are crawlers.)

Also, whenever one is only partially confident in a classification of an IP range as a crawler, instead of blocking it outright one can serve proof-of-works tasks (à la Anubis) with a complexity proportional to that confidence. This could also be useful in order to keep crawlers somewhat in the dark about whether they've been put on a blacklist.

[–] Opisek@lemmy.world 3 points 8 hours ago (2 children)

You might want to take a look at CrowdSec if you don't already know it.

[–] Novocirab@feddit.org 3 points 7 hours ago* (last edited 7 hours ago) (1 children)

Thanks. Makes sense that things roughly along those lines already exist, of course. CrowdSec's pricing, which apparently start at 900$/months, seem forbiddingly expensive for most small-to-medium projects, though. Do you or does anyone else know a similar solution for small or even nonexistent budgets? (Personally I'm not running any servers or projects right now, but may do so in the future.)

[–] Opisek@lemmy.world 3 points 7 hours ago* (last edited 7 hours ago)

There are many continuously updated IP blacklists on GitHub. Personally I have an automation that sources 10+ of such lists and blocks all IPs that appear on like 3 or more of them. I'm not sure there are any blacklists specific to "AI", but as far as I know, most of them already included particularly annoying scrapers before the whole GPT craze.

[–] rekabis@lemmy.ca 1 points 7 hours ago* (last edited 7 hours ago) (1 children)

Holy shit, those prices. Like, I wouldn’t be able to afford any package at even 10% the going rate.

Anything available for the lone operator running a handful of Internet-addressable servers behind a single symmetrical SOHO connection? As in, anything for the other 95% of us that don’t have literal mountains of cash to burn?

[–] Opisek@lemmy.world 1 points 7 hours ago* (last edited 7 hours ago)

They do seem to have a free tier of sorts. I don't use them personally, I only know of their existence and I've been meaning to give them a try. Seeing the pricing just now though, I might not even bother, unless the free tier is worth anything.

[–] mlg@lemmy.world 9 points 9 hours ago

--recurse-depth=3 --max-hits=256

[–] stm@lemmy.dbzer0.com 32 points 14 hours ago

Such a stupid title, great software!

[–] MonkderVierte@lemmy.ml 21 points 16 hours ago (2 children)

Btw, how about limiting clicks per second/minute, against distributed scraping? A user who clicks more than 3 links per second is not a person. Neither, if they do 50 in a minute. And if they are then blocked and switch to the next, it's still limited in bandwith they can occupy.

[–] letsgo@lemm.ee 9 points 15 hours ago (1 children)

I click links frequently and I'm not a web crawler. Example: get search results, open several likely looking possibilities (only takes a few seconds), then look through each one for a reasonable understanding of the subject that isn't limited to one person's bias and/or mistakes. It's not just search results; I do this on Lemmy too, and when I'm shopping.

[–] MonkderVierte@lemmy.ml 8 points 15 hours ago (1 children)

Ok, same, make it 5 or 10. Since i use Tree Style Tabs and Auto Tab Discard, i do get a temporary block in some webshops, if i load (not just open) too much tabs in too short time. Probably a CDN thing.

[–] Opisek@lemmy.world 1 points 7 hours ago

Would you mind explaining your workflow with these tree style tabs? I am having a hard time picturing how they are used in practice and what benefits they bring.

[–] JadedBlueEyes@programming.dev 8 points 16 hours ago (6 children)

They make one request per IP. Rate limit per IP does nothing.

load more comments (6 replies)
[–] Iambus@lemmy.world 14 points 16 hours ago

Typical bluesky post

[–] Zacryon@feddit.org 58 points 21 hours ago (5 children)

I suppose this will become an arms race, just like with ad-blockers and ad-blocker detection/circumvention measures.
There will be solutions for scraper-blockers/traps. Then those become more sophisticated. Then the scrapers become better again and so on.

I don't really see an end to this madness. Such a huge waste of resources.

[–] arararagi@ani.social 8 points 15 hours ago

Well, the adblockers are still wining, even on twitch where the ads como from the same pipeline as the stream, people made solutions that still block them since ublock origin couldn't by itself.

[–] enbiousenvy@lemmy.blahaj.zone 12 points 17 hours ago

the rise of LLM companies scraping internet is also, I noticed, the moment YouTube is going harsher against adblockers or 3rd party viewer.

Piped or Invidious instances that I used to use are no longer works, did so may other instances. NewPipe have been broken more frequently. youtube-dl or yt-dlp sometimes cannot fetch higher resolution video. and so sometimes the main youtube side is broken on Firefox with ublock origin.

Not just youtube but also z-library, and especially sci-hub & libgen also have been harder to use sometimes.

load more comments (3 replies)
load more comments
view more: next ›