this post was submitted on 01 Aug 2024

110 points (100.0% liked)

Technology

40178 readers

416 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

coldredlight@beehaw.org

remington@beehaw.org

110

Microsoft and Reddit Are Fighting About Why Bing’s Crawler Is Blocked on Reddit (www.404media.co)

submitted 1 year ago by theangriestbird@beehaw.org to c/technology@beehaw.org

36 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Moonrise2473@feddit.it 28 points 1 year ago* (last edited 1 year ago) (4 children)

A search engine can't pay a website for having the honor of bringing them visits and ad views.

Fuck reddit, get delisted, no problem.

Weird that google is ignoring their robots.txt though.

Even if they pay them for being able to say that glue is perfect on pizza, having

User-agent: *
Disallow: /

should block googlebot too. That means google programmed an exception on googlebot to ignore robots.txt on that domain and that shouldn't be done. What's the purpose of that file then?

Because robots.txt is completely based on honor (there's no need to pretend being another bot, could just ignore it), should be

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

[–] MrSoup@lemmy.zip 28 points 1 year ago (2 children)

I doubt Google respects any robots.txt

[–] DaGeek247@fedia.io 27 points 1 year ago (3 children)

My robots.txt has been respected by every bot that visited it in the past three months. I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.

I've only gotten like, 20 visits in the past three months though, so, very small sample size.

[–] mozz@mbin.grits.dev 13 points 1 year ago (1 children)

I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.

This is fuckin GENIUS

[–] Moonrise2473@feddit.it 8 points 1 year ago (2 children)

only if you don't want any visits except from yourself, because this removes your site from any search engine

should write a "disallow: /juicy-content" and then block anything that tries to access that page (only bad bots would follow that path)

[–] Miaou@jlai.lu 23 points 1 year ago (1 children)

That's exactly what was described..?

[–] Moonrise2473@feddit.it 3 points 1 year ago (1 children)

Oops. As a non-native English speaker I misunderstood what he meant. I understood wrongly that he set the server to ban everything that asked for robots.txt

[–] Zoop@beehaw.org 2 points 1 year ago

Just in case it makes you feel any better: I'm a native English speaker who always aced the reading comprehension tests back in school, and I read it the exact same way. Lol! I'm glad I wasn't the only one. :)

[–] mozz@mbin.grits.dev 5 points 1 year ago

You need to read again the thing that was described, more carefully. Imagine for example that by “a page,” the person means a page called /juicy-content or something.

[–] thingsiplay@beehaw.org 2 points 1 year ago* (last edited 1 year ago)

Interesting way of testing this. Another would be to search the search machines with adding site:your.domain (Edit: Typo corrected. Off course without - at -site:, otherwise you will exclude it, not limit to.) to show results from your site only. Not an exhaustive check, but another tool to test this behavior.

[–] MrSoup@lemmy.zip 2 points 1 year ago

Thank you for sharing

[–] Moonrise2473@feddit.it 10 points 1 year ago

for common people they respect and even warn a webmaster if they submit a sitemap that has paths included in robots.txt

[–] skullgiver@popplesburger.hilciferous.nl 15 points 1 year ago (1 children)

I think Reddit serves Googlebot a different robots.txt to prevent issues. For instance, check Google's cached version of robots.txt: it only blocks stuff that you'd expect to be blocked.

[–] Zoop@beehaw.org 2 points 1 year ago

User-Agent: bender

Disallow: /my_shiny_metal_ass

Ha!

[–] tal@lemmy.today 4 points 1 year ago* (last edited 1 year ago)

I guessed in a previous comment that given their new partnership, Reddit is probably feeding their comment database to Google directly, which reduces load for both of them and permits Google to have real-time updates of the whole kit-and-kaboodle rather than polling individual pages. Both Google and Reddit are better-off doing that, and for Google it'd make sense for any site that's large-enough and valuable enough to warrant putting forth any effort special-case to that site.

I know that Reddit built functionality for that before, used it for pushshift.io and I believe bots.

I doubt that Google is actually using Googlebot on Reddit at all today.

I would bet against either Google violating robots.txt or Reddit serving different robots.txt files to different clients (why? It's just unnecessary complication).

[–] jarfil@beehaw.org 3 points 1 year ago

Google is paying for the use of Reddit's API, not for scraping the site.

That's the new Reddit's business model: want "their" (users') content, then pay for API access.