this post was submitted on 11 Jan 2024
33 points (92.3% liked)

Linux

48069 readers
866 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Hi everyone,

ever since I switched to Arch about two months ago, most applications segfault multiple times a day. There doesn't seem to be any pattern for the crashes, sometimes it's even happening while idling (e.g. reading a news article).

Things I've tried without any luck so far:

  • Running Firefox in safe-mode without any extensions
  • Switching from regular to LTS kernel
  • Disable Hardware Acceleration in Firefox
  • Change RAM speed and timings
  • Run Memtest successfully
  • Replace entire RAM with a new certified kit
  • Use only a single RAM slot
  • Apply Ryzen fixes (iommu=soft, limit c-states)
  • Use only a single CPU core (maxcpus=1)
  • Downgrade Nvidia driver to 535xx
  • Use Nouveau instead of the nvidia driver
  • Use Openbox instead of KDE
  • Disable zswap and THP

Here's full journalctl from a day where both Spotify and Firefox crashed at the end, a few seconds after each other:

https://pastebin.com/BH0LMnD9

Some more info about my system:

  • Ryzen 5 3600X
  • MSI B450M PRO-VDH Max
  • 32GB RAM @ 3200MHz
  • Geforce RTX 2070 SUPER (using nvidia-dkms)
  • Plasma 5.27.10 on X11

I'm pretty sure that it's not hardware related, because I've booted up a Debian 12 live image where everything ran for several hours without a crash. But it seems to be Arch related, as I also booted up a fresh EndeavourOS live image (so basically Arch), where applications also randomly segfaulted. Any idea why everything works fine on Debian but not on Arch? Debian uses the 6.1 kernel, which I already tried, so that's not it.

Let me know if you need any more information that might help solve this issue. Thanks!

Edit [solved]: It looks like disabling PBO in the UEFI/BIOS did the trick. The strange thing is, after enabling it again, it's still not crashing again. Someone suspected that the MoBo default/training settings were faulty, so I guess this was a very rare case here. That's probably why it took so long to find a solution. Thanks everyone for helping me out!

all 35 comments
sorted by: hot top controversial new old
[–] gbin@lemmy.ca 7 points 10 months ago (1 children)

The crashes are in the middle of browsers (both Firefox and chrome embedded in Spotify), if you try a simple mprime stress test (from the AUR mprime-bin) does it crash too?

[–] cbarrick@lemmy.world 3 points 10 months ago

Yeah, this sounds somewhat like unstable hardware.

Definitely start with a stress test or memory test.

[–] avidamoeba@lemmy.ca 6 points 10 months ago* (last edited 10 months ago) (2 children)

Could be a defective library that's used by many apps. Glibc, etc. That said, if something like this is that broken, others should be complaining about it too.

[–] gbin@lemmy.ca 5 points 10 months ago

One crash was in libxul and the other in libcef I doubt this is a specific lib

[–] 30021190@lemmy.cloud.aboutcher.co.uk 3 points 10 months ago (2 children)

Maybe a corrupt download/copy of a library.... Try a reinstall of say glibc ?

[–] avidamoeba@lemmy.ca 1 points 10 months ago

This is a good idea, but they probably need to figure out which lib is shitting the bed first. There's too many libs to try otherwise.

[–] SpaceCadet@feddit.nl 1 points 10 months ago

Maybe a corrupt download/copy of a library… Try a reinstall of say glibc ?

Doesn't explain why it also crashes in an EndeavourOS live image...

[–] lemming741@lemmy.world 4 points 10 months ago (1 children)

I had a 3700x that was doing that sort of thing. It seemed mostly random, but moving big files would crash it pretty often. It ran memtest86 for 3 days no problem. I replaced part by part, and it ended up being the CPU. I'd bought it second hand so it may have been abused.

[–] NoisyFlake@lemm.ee 1 points 10 months ago (2 children)

But if it's a faulty CPU, wouldn't it also crash on Debian?

[–] avidamoeba@lemmy.ca 5 points 10 months ago (1 children)

Wild guess, there could be differences in compilation optimization that expose this hypothetical proc defect on Arch but not on Debian. Try a day or two of mprime as some others suggested.

[–] SpaceCadet@feddit.nl 2 points 10 months ago (1 children)

Try a day or two of mprime as some others suggested.

That wouldn't necessarily reveal a faulty CPU or firmware. I used to have a 3600x that would sometimes crash on idle at low clocks but would run cinebench or geekbench all day and all night.

[–] avidamoeba@lemmy.ca 1 points 10 months ago

For sure. It would catch a subset of issues.

[–] lemming741@lemmy.world 1 points 10 months ago

I would think so, but it sounds similar enough with the symptoms and very similar on the model CPU so I thought I'd mention it

[–] SpaceCadet@feddit.nl 4 points 10 months ago (1 children)

I’m pretty sure that it’s not hardware related

Random segfaulting is not something that "just happens" because of an OS misconfiguration, then if the same problem happens on Arch as well as on a clean EndeavourOS live image it convinces me that it is in fact hardware related somehow. As you have already replaced the RAM, my guess is CPU or motherboard issue.

Zen2/B450 is a widely used and well supported configuration on Linux that you normally shouldn't have issues with, but Zen2 CPUs are rather notorious for having fragile memory controllers, and sometimes dodgy AGESA firmware releases that can cause issues on some CPUs. I used to have a 3600X myself that started crashing at idle around a particular firmware release of my motherboard, and it was fixed by a subsequent release.

BTW the fact that it doesn't happen on Debian doesn't necessarily mean that Arch is the culprit. It could just be that Debian is not triggering the fault because of different, perhaps more conservative, compiler optimizations.

As a last ditch effort, you could try resetting your entire UEFI (bios) settings to default, preferably by pulling the CMOS battery.

BTW, is it only GUI applications that are segfaulting? Or other programs as well? Do you have an old spare GPU you can test with?

[–] NoisyFlake@lemm.ee 2 points 10 months ago (1 children)

I already did a UEFI reset, that didn't help. As far as I can tell, it's only GUI applications, I haven't seen a segfault for something else so far. Unfortunately I don't have any other GPU right now.

It seems that a solution was found though (at least for now, it didn't crash since a few hours) here: https://lemm.ee/comment/8161085

[–] SpaceCadet@feddit.nl 3 points 10 months ago

Glad to hear that disabling PBO helped, but it does indicate that something may not be entirely healthy with your CPU (or with the way the motherboard is driving it, that also can't be excluded)

[–] vzq@lemmy.blahaj.zone 4 points 10 months ago (1 children)

Can you enable core dumps and get stack traces? From there you should be able to figure out which shared library is broken.

[–] NoisyFlake@lemm.ee 2 points 10 months ago

Uhm, isn't that what can be found at the end of the journalctl log I posted? Or are you talking about something different?

[–] Ludrol@szmer.info 2 points 10 months ago (1 children)

I would guess that this is ~~CPU~~ SSD issue you ran an live debian image from an usb and did not encounter any crashes.

[–] NoisyFlake@lemm.ee 2 points 10 months ago

But I also ran a live EndeavourOS from USB and the same crashes happened.

[–] vildis@lemmy.dbzer0.com 1 points 10 months ago

Could you try an older endeavour os image?

This sounds very much like a driver/firmware/hardware issue

[–] CameronDev@programming.dev 1 points 10 months ago (1 children)

Try increasing RAM voltage? Might make it more stable under load. I had a similar issue, clean memtest, but games would randomly crash. Increasing RAM voltage fixed it.

[–] NoisyFlake@lemm.ee 1 points 10 months ago (1 children)

What voltage should I try? It's currently at 1.35V, and I've read somewhere that this is the highest "safe" voltage.

[–] CameronDev@programming.dev 1 points 10 months ago* (last edited 10 months ago)

I jumped to 1.4V which afaik is safe. But i cant guarentee anything. Going up slowly might be better, but stop at 1.4?

Corsair says 1.4 is safe: https://help.corsair.com/hc/en-us/articles/360052448851-Tips-on-safely-overclocking-memory

[–] drwho@beehaw.org 0 points 10 months ago (1 children)

Are you keeping an eye on system temperature?

[–] NoisyFlake@lemm.ee 0 points 10 months ago (1 children)

Yeah, temperatures are usually between 40-50 °C, so that should be fine.

[–] drwho@beehaw.org 0 points 10 months ago (1 children)

Yeah, that should be fine.

Anything in the kernel message buffer? dmesg -T | less

[–] NoisyFlake@lemm.ee 1 points 10 months ago (1 children)

I'm not sure, here's the entire dmesg output: https://pastebin.com/MZfhB0xK

[–] drwho@beehaw.org 1 points 9 months ago

I'm not seeing anything relevant to lockups or crashes in there. Pretty boring logs.