Over the weekend we had a large intermittent outage, followed up by unplanned maintenance that I had put off for way too long.
Lemmy runs with several different services.
- lemmy-ui (the reactesque frontend)
- lemmy (the rust backend)
- postgres (the data store for operations, comments, posts, etc)
- pictrs (the image data store)
The outage concerns itself with the last one. We always knew we'd eventually need to migrate to an object based store, but Lemmy defaults to file based picture storage and that's what we stuck with up until now. This eventually caused the VPS that programming.dev is running on to seize up, and resulted in the outage over the weekend.
Saturday night I spent several hours testing out the object migration on the beta.programming.dev site in order to validate that it worked. During this time I struggled with some very obtuse ansible errors that I hadn't encountered before and so I was not able to start the migration that night. I delayed until the next morning (thank goodness).
I began work Sunday morning at 10:00 America/Denver time. Initially the migration started off quite well, but was moving incredibly slowly. Looking back on it now, the migration would have taken over 144 hours if I left it to do its thing. I let this run for about an hour before messaging the pictrs dev to understand why logs weren't showing up for the migration (even though objects were showing up in the store). Apparently lemmy-ansible is set to use 0.4.0 of pictrs, which not only is quite old, but doesn't have the ability to run migrations concurrently. There was the issue. I asked the dev is it was possible to stop a migration in the middle of the running, upgrade, and continue. They told me what changes I'd need to make, I made them, did the upgrade, and restarted the migration. It immediately failed. This was the start of my issues.
The server was now too full of data to do anything, including running apt update
or apt install
to install tools to assist me. I was able to attach more block storage, but I'm not enough of a linux guru to figure out how to mount it where the current pictrs filesystem would be able to take advantage of it. I had to result to copying the entire pictrs filesystem to a fresh ~500gb mount, fixing permissions, and then rerunning the migration from there. By the time I got to this point, it was about 12:30PM. The migration from then on took several hours.
After the migration completed, I needed to deploy the new stack with the correct settings. The ansible script needed to run apt
though, and, well, that wouldn't work when the server was still full. At this point I was not confident in the migration and I also hadn't realized that you could do the migration while the site was running (big oversight from me). I therefore wanted to maintain the entire pictrs file store until I proved the object store was working. I created another block storage, copied the entire pictrs directory over to it again (another 20 minutes or so) and then deleted the original directory. I was now able to run the ansible script and deploy the new settings for pictrs, confident that I had a backup available in case something went wrong (this is not the main backup method, the server is backed up externally as well, but I didn't want to have to resort to those during the migration).
That completed the migration, some 5 hours after it originally started.
There were several things that exacerbated the issue that made it take several hours longer than I wanted.
- I let it go so long before doing the migration to object storage that the server was too full to even perform an
apt update
. This resulted in me not being able to install tools I needed, along with a host of other issues as mentioned - pict-rs was at a very suboptimal version. If it had just been two minor versions newer it would have migrated perfectly fine, in a few hours.
- my limited knowledge around ansible led me on wild goose chases several times
Things I would change if I had to do it again:
- Dig in a bit deeper on the concurrency flag in the pictrs docs. It was not present in the original guide I followed (from a lemmy post on another instance), and thus I didn't realize that it wouldn't run with concurrency at all.
- Don't wait so long so that the server is full
- Migrate while the server is running. That would have been dumb in this case, since the server wouldn't stay up anyway, and could have caused other issues. But there was no reason to take the server down if it had been stable, and other instances have done so with no problems.
We can’t thank y’all enough for putting in the energy, money, and time to keep this thing running. Amazed that there haven’t been more outages given the circumstances and how unstable Lemmy still is at this point in its development. Thanks a ton!
Haha there’s like 10 outages a day 😂 most people just don’t notice.
Not sure outage is the right word though, it’s that our status.programming.dev provider (pulsetic) reports an outage. Hyperping also reports several a day. It’s usually just a 500 so something is happening but I’m not sure what.