ptz

joined 2 years ago
MODERATOR OF
[–] ptz@dubvee.org 2 points 1 hour ago

Maybe I should flesh it out into an actual guide. The Nepenthes docs are "meh" at best and completely gloss over integrating it into your stack.

You'll also need to give it corpus text to generate slop from. I used transcripts from 4 or 5 weird episodes of Voyager (let's be honest: shit got weird on Voyager lol), mixed with some Jack Handy quotes and a few transcripts of Married...with Children episodes.

[–] ptz@dubvee.org 13 points 3 hours ago* (last edited 3 hours ago) (2 children)

Thanks!

Mostly there's three steps involved:

  1. Setup Nepenthes to receive the traffic
  2. Perform bot detection on inbound requests (I use a regex list and one is provided below)
  3. Configure traffic rules in your load balancer / reverse proxy to send the detected bot traffic to Nepenthes instead of the actual backend for the service(s) you run.

Here's a rough guide I commented a while back: https://dubvee.org/comment/5198738

Here's the post link at lemmy.world which should have that comment visible: https://lemmy.world/post/40374746

You'll have to resolve my comment link on your instance since my instance is set to private now, but in case that doesn't work, here's the text of it:

So, I set this up recently and agree with all of your points about the actual integration being glossed over.

I already had bot detection setup in my Nginx config, so adding Nepenthes was just changing the behavior of that. Previously, I had just returned either 404 or 444 to those requests but now it redirects them to Nepenthes.

Rather than trying to do rewrites and pretend the Nepenthes content is under my app's URL namespace, I just do a redirect which the bot crawlers tend to follow just fine.

There's several parts to this to keep my config sane. Each of those are in include files.

  • An include file that looks at the user agent, compares it to a list of bot UA regexes, and sets a variable to either 0 or 1. By itself, that include file doesn't do anything more than set that variable. This allows me to have it as a global config without having it apply to every virtual host.

  • An include file that performs the action if a variable is set to true. This has to be included in the server portion of each virtual host where I want the bot traffic to go to Nepenthes. If this isn't included in a virtual host's server block, then bot traffic is allowed.

  • A virtual host where the Nepenthes content is presented. I run a subdomain (content.mydomain.xyz). You could also do this as a path off of your protected domain, but this works for me and keeps my already complex config from getting any worse. Plus, it was easier to integrate into my existing bot config. Had I not already had that, I would have run it off of a path (and may go back and do that when I have time to mess with it again).

The map-bot-user-agents.conf is included in the http section of Nginx and applies to all virtual hosts. You can either include this in the main nginx.conf or at the top (above the server section) in your individual virtual host config file(s).

The deny-disallowed.conf is included individually in each virtual hosts's server section. Even though the bot detection is global, if the virtual host's server section does not include the action file, then nothing is done.

Files

map-bot-user-agents.conf

Note that I'm treating Google's crawler the same as an AI bot because....well, it is. They're abusing their search position by double-dipping on the crawler so you can't opt out of being crawled for AI training without also preventing it from crawling you for search engine indexing. Depending on your needs, you may need to comment that out. I've also commented out the Python requests user agent. And forgive the mess at the bottom of the file. I inherited the seed list of user agents and haven't cleaned up that massive regex one-liner.

# Map bot user agents
## Sets the $ua_disallowed variable to 0 or 1 depending on the user agent. Non-bot UAs are 0, bots are 1

map $http_user_agent $ua_disallowed {
    default 		0;
    "~PerplexityBot"	1;
    "~PetalBot"		1;
    "~applebot"		1;
    "~compatible; zot"	1;
    "~Meta"		1;
    "~SurdotlyBot"	1;
    "~zgrab"		1;
    "~OAI-SearchBot"	1;
    "~Protopage"	1;
    "~Google-Test"	1;
    "~BacklinksExtendedBot" 1;
    "~microsoft-for-startups" 1;
    "~CCBot"		1;
    "~ClaudeBot"	1;
    "~VelenPublicWebCrawler"	1;
    "~WellKnownBot"	1;
    #"~python-requests"	1;
    "~bitdiscovery"	1;
    "~bingbot"		1;
    "~SemrushBot" 	1;
    "~Bytespider" 	1;
    "~AhrefsBot" 	1;
    "~AwarioBot"	1;
#    "~Poduptime" 	1;
    "~GPTBot" 		1;
    "~DotBot"	 	1;
    "~ImagesiftBot"	1;
    "~Amazonbot"	1;
    "~GuzzleHttp" 	1;
    "~DataForSeoBot" 	1;
    "~StractBot"	1;
    "~Googlebot"	1;
    "~Barkrowler"	1;
    "~SeznamBot"	1;
    "~FriendlyCrawler"	1;
    "~facebookexternalhit" 1;
    "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;

}

deny-disallowed.conf

# Deny disallowed user agents
if ($ua_disallowed) { 
    # This redirects them to the Nepenthes domain. So far, pretty much all the bot crawlers have been happy to accept the redirect and crawl the tarpit continuously 
	return 301 https://content.mydomain.xyz/;
}

[–] ptz@dubvee.org 61 points 4 hours ago (10 children)

I was blocking them but decided to shunt their traffic to Nepenthes instead. There's usually 3-4 different bots thrashing around in there at any given time.

If you have the resources, I highly recommend it.

[–] ptz@dubvee.org 6 points 5 hours ago* (last edited 5 hours ago)

Most of the requirements are going to be for the database, and that depends on:

  1. How many active users you expect
  2. How many large rooms you or your users join

I left many of the large Matrix spaces I was in, and mine is now mostly just 1:1 chats or a group chat with a handful of friends. Given that low-usage case, I can run my server on a Pi 3 with 4 GB of RAM quite comfortably. I don't do that in practice, but I do have that setup as a backup server - it periodically syncs the database from my main server - and works fine. The bottleneck there, really, is the SD card storage since I didn't want an external SSD hanging off of it.

Even when I was active in several large Matrix spaces/rooms, a USFF Optiplex with a quad core i5, 8 GB of RAM, and a 500GB SSD was more than enough to run it comfortably alongside some other services like LibreTranslate.

[–] ptz@dubvee.org 7 points 7 hours ago (1 children)

I disabled local thumbnail generation almost a year ago, and things mostly work the same.

Instead of a local thumbnail image URL for things like news articles that get posted, it will be the direct URL value from the og:image metadata from the source. Usually those load fine, but sometimes they don't due to CORS settings on their side. Probably only 1-2% of posts have issues, though.

For image posts that come in via federation, (memes, pics, etc), the thumbnail image URL is the same as the post URL. In other words, you're loading the full res version in the feed. Since I use a web client that has "card view", this actually works out better, visually. YMMV whether that's a drawback for you.

The only pitfall is that you will lose thumbnails for image posts if an instance goes offline or shuts down.

I'm sure that does increase load slightly on other instances, but no more than if the remote instance had image proxying turned on. And the full-res version always has to load from the remote instance (even if you have local thumbnail generation enabled). All in all, I'd say the additional load is acceptable given the benefits of disabling local thumbnail generation.

To mitigate that, in my case anyway, I have my own image proxy/cache in place. My default UI is Tesseract and it's configured with the image proxy/cache on by default.. (I think I saw that Photon is also working on something similar). In this configuration, the first person to scroll past a remote image fetches it directly (via the proxy/cache) and it's now available locally in the cache for everyone else (unless they're connecting with a different client that doesn't use Tesseract's proxy). Granted, I shutdown my instance last year and it's just now a private testbed for development, but when I did have daily active users (plural), the proxy cache helped.

Now the only images my instance stores are ones that are uploaded locally.

Why did I disable local thumbnails?

  • I closed up my instance and didn't want potentially problematic thumbnails being generated while I wasn't actively modding it
  • Generated thumbnails go in, but they don't go out. There's no way to clean them up later other than the ephemerally generated ones (if someone requests a version in a custom size, for example)
  • Increasing storage costs. Like, I'd be scrolling the feed and see some of the dumbest shitposts while constantly thinking "Ugh, this is costing me money to store a copy".
[–] ptz@dubvee.org 102 points 23 hours ago (4 children)

Other Republican politicians, including former President Donald Trump, also criticized the show as inappropriate.

  1. If only
  2. USA Today needs to either use a better model or just get rid of the AI-generated key point summary.
[–] ptz@dubvee.org 4 points 1 day ago

Orphan Black: Live

[–] ptz@dubvee.org 9 points 1 day ago* (last edited 1 day ago)

Somewhere between 35 and 39, but yeah. Not sure how old she was when we got her (fully grown), but I was 5 or 6 then and was 40 when she passed. Have to assume it was just old age Always called her "Horse, of Course" lol

[–] ptz@dubvee.org 23 points 1 day ago (11 children)

Sorry to hear. How old was he? My family had a horse since I was like 5 or 6. She hated being ridden but would follow you around like a dog. She died year-before-last at, I believe, age 39.

 

An American superhero fan short film based on the Power Rangers franchise; unlike the kid-friendly franchise, the short depicts an adult-oriented take on the source material.

It was directed by Joseph Kahn, who co-wrote with James Van Der Beek and Dutch Southern, and produced by Adi Shankar and Jil Hardin. The short film featured an ensemble cast starring Katee Sackhoff, Van Der Beek, Russ Bain, Will Yun Lee, and Gichi Gamba. It was released on YouTube and Vimeo on February 23, 2015.

[–] ptz@dubvee.org 47 points 2 days ago* (last edited 2 days ago) (5 children)

I prefer sans-serif fonts visually but prefer serif for readability. So I use Atkinson Hyperlegible which is a mish-mash of both.

And bonus meme:

[–] ptz@dubvee.org 1 points 2 days ago (1 children)

FYI: I moved the allow rule for DNS to the top of the chain, so that should fix problems with DNS providers not being able to reach the authoritative name servers.

[–] ptz@dubvee.org 3 points 3 days ago* (last edited 3 days ago)

Ugh. Thanks. It's quite possible, though maybe just a regional one? I did inadvertently block one of the IPs Let's Encrypt uses for secondary validation, so this may be another case of that.

I get a shitload of bad traffic from the southeast Asia area (mostly Philippines/Singapore AWS) and have taken to blanket blocking their whole routes rather than constantly playing whack-a-mole. Fail2ban only goes so far for case-by-case.

Here's the image from the meme from an alternate source:

 
 

Edit: Okay, so I tried to reproduce this while I was putting together a bug report, and it's no longer doing it when I was following my "steps to reproduce". I re-added .ml to my domain filter list, and can still resolve posts to c/Books which has a link to .ml in its description. So maybe it was just a glitch? I'm gonna play around with it and see before submitting it as a bug. But I do know before I removed .ml from that list yesterday, it refused to because of the link in the description (none of the test posts hit on that domain) and consistently said Domain is blocked. :sigh:

Edit 2: Ok, now it's behaving as described. There must be some lag/delay between adding a domain to the filter list and it applying to inbound federation. Submitting a bug.

Edit 3: Bug 6320


I was one of the trailblazers who defeded from .ml and once the domain filtering feature was added, I added lemmy [dot] ml to my domain blocks in the admin panel. Reason being, I don't want .ml content including crossposts and re-posted images.

I thought that was working great until I noticed today that I hadn't gotten any posts to !books@lemmy.world for several months. Even trying to manually resolve a post pulled from there directly, it wouldn't load. Finally checked the server logs, and there was a Domain is blocked event right after the logged call to ResolveObject. Of course the logs didn't say what domain.

Long story short, after scouring the randomly-selected test post to see if there was some kind of false positive, I finally realize there's a "Related Community" link to a community on .ml in c/Books's community description and that was what it was hitting on. Any post coming in to c/Books was being rejected because the community description linked to something in my site's URL filters.

 

My backyard is on a hill, and the neighbor's kids decided to sled all the way down it, slam into the top of my retaining wall, knock a bunch of shit off (breaking two big terra-cotta planters in the process) all while screeching like banshees. Needless to say, I'm not super happy with them or their parents.

So I hear the commotion, step out the back door, and literally and instinctively yell the thing.

Spring project # 185: Install a fence, possibly electrified and razor-wired.

6
submitted 1 week ago* (last edited 1 week ago) by ptz@dubvee.org to c/videos@lemmy.world
 

The film starts by reviewing the concept and the early days of phreaking, featuring anecdotes of phreaking experiences (often involving the use of a blue box) recounted by John Draper and Denny Teresi. By way of commentary from Steve Wozniak, the film progresses from phreaking to computer hobbyist hacking (including anecdotal experiences of the Homebrew Computer Club) on to computer security hacking, noting differences between these 2 forms of hacking in the process.

The featured computer security hacking and social engineering stories and anecdotes predominantly concern experiences involving Kevin Mitnick. The film also deals with how society's (and notably law enforcement's) fear of hacking has increased over time due to media attention of hacking (by way of the film WarGames as well as journalistic reporting on actual hackers) combined with society's further increase in adoption of and subsequent reliance on computing and communication networks.

563
submitted 3 weeks ago* (last edited 3 weeks ago) by ptz@dubvee.org to c/linuxmemes@lemmy.world
 

About the only time I find myself using regular Wikipedia these days is if I need to know if someone died since August 2025 when this ZIM dump was created.

 

With launch potentially just three weeks away, the agency is working tirelessly to get the SLS rocket, Orion spacecraft, and the Artemis 2 crew ready for liftoff.

It’s official: NASA plans to roll its Space Launch System (SLS) rocket and Orion spacecraft out to the launch pad on Saturday. The move will signal the final stage of preparation for the Artemis 2 mission, which will send astronauts beyond Earth’s orbit and around the Moon for the first time since the Apollo era.

In a Friday update, NASA said it could take up to 12 hours for the SLS to complete the 4-mile (6.4-kilometer) journey from the Vehicle Assembly Building (VAB) at Kennedy Space Center to Launch Pad 39B. Teams are working 24/7 to complete the necessary tasks ahead of rollout, but it could be pushed back if they need more time for technical preparations or if the weather interferes.

“We have important steps remaining on our path to launch and crew safety will remain our top priority at every turn, as we near humanity’s return to the Moon,” Lori Glaze, acting associate administrator for NASA’s Exploration Systems Development Mission Directorate, said in the statement.

If the agency can complete these steps without any major complications, Artemis 2 could launch as soon as February 6.

 

PIC S3E05: Imposters

 
 

A little late, but didn't think of it until I put on this episode.

130
I felt that. (tesseract.dubvee.org)
 

PIC S1E07: Nepenthe

view more: next ›