• squidspinachfootball
    link
    fedilink
    English
    arrow-up
    4
    ·
    4 months ago

    iirc, isn’t robots.txt more of a gentlemen’s agreement? I vaguely recall bots being able to crawl a site regardless, it’s just that most devs respect robots.txt and don’t. Could be wrong though, happy to be corrected.

    • tal@lemmy.today
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      4 months ago

      Sure, you can write software that violates the spec. But I mean, that’d be true for anything that Reddit can do on their end. Even if they block responses to what they think are bots, software can always try hard to impersonate users and scrape websites. You could go through a VPN, pretend to be a browser being linked to a page.

      But major search engines will follow the spec with their crawlers.

      EDIT: RFC 9309, Robots Exclusion Protocol

      https://datatracker.ietf.org/doc/html/rfc9309

      If no matching group exists, crawlers MUST obey the group with a user-agent line with the “*” value, if present.

      To evaluate if access to a URI is allowed, a crawler MUST match the paths in “allow” and “disallow” rules against the URI.

      EDIT2: Even if, amusingly, Google apparently isn’t for this particular case with GoogleBot, given the way that they’re signing agreements. They’ll honor it for sites that they haven’t signed agreements with, though.

      EDIT3: Actually, on second thought, GoogleBot may be honoring it too. GoogleBot may not be crawling Reddit anymore. They may have some “direct pipe” that passes comments to Google that bypasses Google’s scraper. Less load on both their systems, and lets Google get real-time index updates without having to hammer the hell out of Reddit’s backend to see if things have changed. Like, think of how Twitter’s search engine is especially useful because it has full-text search through comments and immediately updates the index when someone comments.

      • squidspinachfootball
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 months ago

        That’s a good point, it’s probably way less load and overhead if Reddit and Google just sent info back and forth instead of scraping. Good way for Google to keep their spot as the favoured search engine and beat the competition too, since everything that comes up these days are articles full of SEO nonsense at best, then AI generated nonsense at worst. If nobody else can read the actual human responses, Google has a huge leg up. Also interesting to see that Google’s honouring the txt file even when nobody’s holding them to it.

        I had no idea Twitter’s search updated their index immediately after a comment is posted though. That’s a lot of updates considering the amount of posts they get daily.

        • tal@lemmy.today
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          4 months ago

          I had no idea Twitter’s search updated their index immediately after a comment is posted though.

          While I never had a Twitter account, it’s the major reason that I used the service anonymously. In an unfolding event, like a natural disaster or something, it was absolutely unparalleled in its ability to rapidly comb through enormous amounts of information being plonked in by people around the world. I strongly prefer Reddit-style forum structure most of the time, but for issues for which there is no pre-existing communities and where the common issue is one that will only exist for a short period of time, I think that Twitter’s ad-hoc connections between retweets and hashtags works much better than Reddit’s association-of-comments-by-subreddit. I understand that Mastodon, unfortunately, doesn’t have a full-text search feature, just searching based on exact hashtags. Actually…hmm. I was just talking about Kagi’s search lens for the Threadiverse in another comment that I saw. I wonder if Kagi actually indexes Mastodon as well? That’d provide for similar functionality.

          investigates

          No, it looks like they only do the Reddit-alike Threadiverse (lemmy, kbin, mbin, etc), for which they use the term “Fediverse Forums”.

          investigates further

          It does look like they index in real time, though, or at least quickly – they probably are one of the institutions out there with an instance slurping up everything out there. I was able to find your comment on that search lens.

          That’s a lot of updates considering the amount of posts they get daily.

          Yeah, I’m sure that however the Twitter guys built it, they specifically designed it around permitting inexpensive index updates.