Communities on big servers not showing all (or any) posts

danielton@outpost.zeuslink.net · edit-2 1 year ago

Communities on big servers not showing all (or any) posts

russjr08@outpost.zeuslink.net · 1 year ago

Ninja edit before I’ve even submitted this: I think I’ve identified something that might be the source of the problem and am working on it now, but I’ll leave the rest of my original comment since it still clarifies some good points about ActivityPub and where/why these issues happen in the first place.

Hey there! I appreciate you reporting this issue! This is going to be a long comment, so the very quick TL;DR is that I do see what you’re seeing as well, but am having a hard time tracking down the source of the problem to even confirm if its an issue on our end - if you have the time though I’d recommend reading this whole comment as to why that is 😅

This looks like some federation issues between us and lemmy.world, I suspect that it could be due to how large lemmy.world is, and how many instances are connected to them. I do see us on their linked instances list (which can be found on any instance by going to /instance) and conversely they’re on our instance as well.

What is even stranger though is that I definitely see posts and comments coming in from lemmy.world too, for example at !technology@lemmy.world there is this post which went up three hours ago, and at the time of writing we show 75 76 (just refreshed) comments on that post, and on lemmy.world there are 79 comments - that is almost perfectly lined up, and when I sort the comments by new there are comments from 3, 6, and 13 minutes ago.

And yet, for !antiquememesroadshow@lemmy.world there is nothing there. Out of curiosity, when you visit that community on our side does it show you as “Joined” or “Subscribe Pending”?

Ideally it should be “Joined” because that means lemmy.world has acknowledged that you’re (or rather, we) want updates from that community - whereas the pending status means that our instance has sent out the request, but they have not acknowledged it yet (or possibly, they did send the signal acknowledging the subscription but we didn’t receive it - this video called “The Two Generals’ Problem” describes that situation).

For that to make sense I’ll give a quick breakdown of how ActivityPub works as I understand it (just in case anyone reading doesn’t already know, and if I have got it wrong please do correct me!):

Basically, every instance has an inbox to receive updates at, and when you subscribe to another community our instance sends a message over to that instance’s /inbox saying “Hey, we want updates (posts/comments/upvotes/downvotes/etc) from this community!” and that instance (lemmy.world as an example here, which I’ll start abbreviating as LW for the sake of brevity) will go “Okay, I’ll tell you about anything that occurs on this community” (this signal coming back is what makes the subscription status change to “Joined”).

Then, whenever any action that needs to be federated happens on that community, the instance that the community is hosted on sends a message over to every instance that has at least one person subscribed to that community directly to that instance’s inbox, and in this example our instance here would then take that message/action and apply it to what is essentially a mirror of the community over here.

This method of publishing-subscribing (often shortened as “PubSub”) works really well, up to a certain extent in terms of scaling. Unfortunately what this means is that any of these situations can cause a post/comment/upvote/etc to not replicate between us <-> LW (or any other instance):

The remote instance has a network error, which causes the update to not actually be sent out (whether to all federated instances, or just a single one)
The receiving instance has a network error, which causes the update to not be received on our side
The remote instance encounters a server-side issue which causes the update to never be sent out
The receiving instance encounters a server-side issue which causes the update to not be received on our side

… and the list goes on, but the main point being that any error that occurs with the Lemmy software itself (or something in between) has a chance of occurring on either side. And unfortunately, if it occurs on the remote instance’s side (like LW) then there’s not really too much that I can do on my end to correct the problem on a technical/administration level. I do believe Lemmy itself has a retry queue for update messages to some extent, but I don’t know how resilient it is (and it is also in-memory only, so if Lemmy gets restarted, then all of those re-queued messages get lost).

Doing some further testing, I think the issue is with new community subscriptions to LW, I went ahead and tried to subscribe to !googlepixel@lemmy.world and it seems to be stuck at “Subscribe Pending” for me. I even did a restart of the Lemmy stack on our side just to be sure there was nothing funky going on, and tried re-subscribing which still cleared it up.

In regards to lemmy.ml, that instance has definitely been having some issues - they were completely down earlier as far as I saw, and I have been having issues with trying to get subscriptions to go through on that instance for a bit now. It’ll clear up randomly, and then all of a sudden will stop again (for example - I can’t for the life of me get a subscription at !lemmy_admin@lemmy.ml to go through, which would have been nice to have when the XSS exploit started going around which thankfully I still got news of pretty quickly since I’m joined to all of the Lemmy Administration Matrix channels).

I know that Federation in general is definitely working, a few weeks ago I built a tool to help instance admins look at federation stats in real time by exporting the data from Lemmy’s Postgres database over to InfluxDB in order to gain time-relation to the data, and thus allowing it to be used on something like Grafana. You can see these stats (along with some general stats about our instance) over at this public dashboard - I’ve had everything setup for a bit now but the dashboard was connected to my personal domain, which while that domain isn’t exactly a secret, I still wanted to have a Grafana instance dedicated to The Outpost which I’ve only gotten setup today (and was just about to make a post about it when I saw your post!). That is all to say that I’ve been monitoring the federation status to make sure nothing is borked from our side.

Additionally, there doesn’t seem to be a firewall/routing issue in terms of us connecting to both LW and lemmy.ml, as I can query the API for both instances directly from the VM that is running our instance. When setting up The Outpost, I even made sure to explicitly opt-out of Cloudflare’s proxy (which one day may have to change if our instance ever became big enough to become a target for DDoS attacks) just because I felt Cloudflare’s protection might interfere with federation.

I’ll try digging through our logs to see if I can find any cases of activity information from either instance being blocked/denied for any sort of reason (we certainly haven’t defederated from them - there are only two instances at the moment that we are defederated from, because those two instances have caused a ton of trouble for others in the past and the present). If I can’t find anything there, I’ll see about trying to reach out to either LW directly, or other instance admins to see if they’re noticing the same problems. I also haven’t completely ruled out the problem being on our end, but given that I do see comments and posts from both instances (even !mlemapp@lemmy.ml seems to be mostly in sync at the present time of writing from what I can see) that makes it harder for me to even find - if it were an all or nothing issue that’d be super easy to track down.

Whew, sorry about how long that comment was, but I wanted to be sure that I was giving all of the details I know about the situation as I like to be transparent when possible (and I myself am usually a pretty open-book person). I know some people tend to not really care about the “magic behind the curtains” so to speak, but I’ve always appreciated seeing technical breakdowns and postmortems from companies since I personally am a big fan of said magic!

I also apologize about the delay in responding to this, I had a ton of dental work done on Thursday which involved three root canals, four fillings, and one tooth extraction - the pain medication that I’ve taken for this has caused me to rest… a lot! Additionally I should find a way to make sure I get push notifications about posts to this community as well…

russjr08@outpost.zeuslink.net · edit-2 1 year ago

Ah ha! I think I know (mostly) exactly what happened here. @danielton@outpost.zeuslink.net could you try to find a community that you’re subscribed to that is stuck on “Subscribe Pending” (this link will take you directly to your subscriptions), then click it to unsub, and finally click it again to re-try the subscription? If you refresh, it should go to “Joined” within 30 seconds or so. If there are zero posts on that community, it should try to ~~backfill about 10~20 posts - I don’t believe Lemmy backfills the comments though~~ (I might’ve been wrong about backfilling, or it just takes a bit of time to happen) and that’ll ensure that everything is working properly.

If anyone wants the technical breakdown/details, that continues from this point onward

When Lemmy 0.18.0’s release came out, one of the targeted fixes was for federation. I believe multiple fixes were made in Lemmy’s codebase in regards to message re-queue times but one of the other things that was recommended to update was the nginx (the web-server that actually receives connections from the internet and proxies the requests between the Lemmy docker container, and the general internet) config.

When I wrote my initial comment here, there was this line:

Additionally, there doesn’t seem to be a firewall/routing issue in terms of us connecting to both LW and lemmy.ml, as I can query the API for both instances directly from the VM that is running our instance.

Right before I submitted that comment I figured it might be a good idea to test querying our own API (and outside of the Lemmy VM, just to be extra cautious) to see what happens, and the results were very telling:

I took a look at our Nginx config and compared it to the recommended config which looked correct, but running the above API query gave me a hint at the where the problem was, which is line 12 in that paste:

                "~^(?:GET|HEAD):.*?application\/(?:activity|ld)\+json" "http://lemmy";

Specifically, this line tells Nginx where to send requests that match a regex pattern for an Accept: application/activity+json header that is for an HTTP GET or HEAD request. At the end of that line is the destination, in this case http://lemmy - I know that looks incorrect at first glance because http://lemmy isn’t a public address but it is not supposed to be a public address, instead its supposed to match an “upstream” configured in Nginx, which our config has the following defined:

        upstream lemmy-backend {
                server lemmy:8536;
        }
        upstream lemmy-frontend {
                server lemmy-ui:1234;
        }

In other words, this tells Nginx that anything in the config that mentions http://lemmy-backend should actually go to http://lemmy:8536 and conversely requests to http://lemmy-frontend should go to http://lemmy-ui:1234. lemmy and lemmy-ui refer to the docker containers for the backend component of Lemmy as well as the frontend.

When I originally was putting together our instance, I explicitly chose to make sure the upstream names didn’t match the container names because in the past I’ve had issues with exactly that. Unfortunately, that is not how the recommended config is structured, and they match their container names with the upstream names - which means when I updated our Nginx config (because everything between line #4 and line #22 was new to the 0.18.0 update) I forgot to swap out that upstream name. Specifically for that line as well, since line #6 and line #18 refer to the right upstream names (lemmy-frontend and lemmy-backend).

Fixing that line, and restarting the server now returns the expected response:

Now, my knowledge of Lemmy’s internals is something I’m still trying to correlate with how ActivityPub works, but I suspect what happened is that any requests we sent out to remote instances/communities was indeed received by that instance, and when they tried to query us back to make sure that our instance was also “speaking ActivityPub/the same language” so to speak, it tried to hit that endpoint that was broken, and thus never confirmed the subscription, and caused us to not receive activity updates. But with communities that were already subscribed to before this update, our subscriptions were already “confirmed” and so we were receiving updates.

Which of course lead to things appearing to work if you weren’t actively subscribing to new communities (and from my point of view when looking at the linked dashboard from my previous comment, I could see that federation was actively occurring) - thus not raising any alarm bells on my side, so I couldn’t investigate until you had notified me. I only had some suspicions of larger instances sometimes being problematic (both LW and lemmy.ml have had downtimes between LW getting DDoS’d yesterday and lemmy.ml just having general growing pains from being the flagship instance) and thought it was on their end.

I sincerely apologize about that, I try to be as cautious as I can when performing updates to make sure that nothing breaks, but when something breaks and isn’t super obvious to spot, it becomes difficult for me to fix what I don’t know about. Our userbase isn’t very large which isn’t necessarily an issue to me (as long as at least one person is getting some use out of The Outpost, then I’m happy) but it does mean that its harder to detect problems at the same rate the larger instances would because there is a larger audience that reports issues. Additionally, when I look at the “Known Communities” metric from our dashboard, it was going up despite the fact that the 0.18.0 update occurred about 20 days ago - I’m not sure how to explain that one…

Ideally I’d love to come up with a series of tests I can run after each update to make sure that nothing is broken, so far the lemmy stats dashboard has done a good job of letting me see when there are potential issues (for example, if the “Comments by Local Users” metric shot up insanely high over say a 30 minute period, then that would tell me that someone is possibly spamming somewhere, allowing me to find that user and kindly tell them to knock it off) but it definitely failed me on this occasion - at the very least, I’ll make sure that I try to perform more federation tests to ensure this doesn’t break. I’ll even look into making some sort of automated monitoring against that endpoint so that I get an instant notification the moment it stops working.

danielton@outpost.zeuslink.net · 1 year ago

Hey, I’m sorry for the delay… Everything is showing as subscribed instead of pending now. Thank you!

russjr08@outpost.zeuslink.net · 1 year ago

It’s all good - I apologize about the issue occurring in the first place!

Now if only I could figure out why Lemmy seems to be “freaking out” every so often and having insanely high spikes of response times for a minute or two till it recovers… Doesn’t seem to be a resource starvation issue as far as I can tell, but that’s what I’ve been trying to tackle next!