Random application segfaults on Arch

NoisyFlake · edit-2 5 months ago

Random application segfaults on Arch

zelifcam@lemmy.world · edit-2 6 months ago

deleted by creator

NoisyFlake · 6 months ago

Hm, I’ve had this problem since my initial setup about 2-3 months ago, I think that if there’s something wrong with the software in the repos, it would’ve been fixed by now and I wouldn’t be the only one having this problem, right?

But of course, if you want I can give the testing repos a try :)

zelifcam@lemmy.world · 6 months ago

deleted by creator

gbin@lemmy.ca · 6 months ago

The crashes are in the middle of browsers (both Firefox and chrome embedded in Spotify), if you try a simple mprime stress test (from the AUR mprime-bin) does it crash too?

cbarrick@lemmy.world · 6 months ago

Yeah, this sounds somewhat like unstable hardware.

Definitely start with a stress test or memory test.

Michael Murphy (S76)@lemmy.world · edit-2 6 months ago

Make sure you have the latest firmware for your motherboard. This sounds like unstable voltages for memory, or an overly-aggressive PBO curve. Did you try disabling the XMP profile on the RAM, disabling PBO, and upping the voltages (within safe limits) of the SOC, DDR, and VDDP? You might find some useful info here[0] or here[1] if you intend to run your memory at 3200 MHz.

NoisyFlake · 6 months ago

Motherboard firmware is up-to-date, and I’ve already tried disabling XMP. I’ll give disabling PBO a try, thanks!

I don’t necessarily have to run at 3200MHz, if it means that the system is finally stable. But since it’s already crashing at the default 2133MHz, I suppose there’s no use in playing with the voltages?

Michael Murphy (S76)@lemmy.world · edit-2 6 months ago

It’s difficult to say for sure with certainty what the issue is without trial and error. I would expect that the motherboard’s manufacturer would make sure that their board can successfully pass all tests with the standard JEDEC spec for DDR4 (2133 MHz).

Since you say that you’ve tried different RAM kits, another alternative could be the cleanliness of power from the power supply. Perhaps there is intermittent voltage droop, and you need to experiment with the Load Line Calibration settings to adjust for vdroop between idle and load. Disabling frequency boosting and manually setting the CPU frequency could help check if it’s related to that. PBO curves might be undervolting too much while idle.

NoisyFlake · 6 months ago

I’m a bit speechless right now. I’ve disabled PBO and didn’t have a single crash since then, everything’s been running fine for hours. Just to make sure that this really was the issue, I’ve enabled PBO again - but still haven’t experienced any crashes in the last hours. I have no idea how simply disabling and then enabling the feature again fixed my issue, but for now it seems like all is well.

Do you have any explanation for this weird behavior?

Anyway, thank you very much for your suggestion, looks like this actually did the trick!

Michael Murphy (S76)@lemmy.world · edit-2 6 months ago

Sounds like voltage droop and/or a motherboard with faulty automatic “training” settings. I don’t recall if the Ryzen 3000 had custom PBO curves, but tweaking this can fix it. Upping LLC and the SOC and CPU voltage slightly alternatively could help. Though I’ve had my most stable overclock by disabling PBO entirely and using a manual CPU multiplier.

Tempy@lemmy.temporus.me · 5 months ago

Try running a memtest, if it’s not voltages it could be a faulty ram stick. I’ve had it where data gets written, but what is read is garbage, corrupted some pretty important files on my system when I ran an update and it used that faulty section for it’s buffer.

Avid Amoeba@lemmy.ca · edit-2 6 months ago

Could be a defective library that’s used by many apps. Glibc, etc. That said, if something like this is that broken, others should be complaining about it too.

gbin@lemmy.ca · 6 months ago

One crash was in libxul and the other in libcef I doubt this is a specific lib

30021190@lemmy.cloud.aboutcher.co.uk · 6 months ago

Maybe a corrupt download/copy of a library… Try a reinstall of say glibc ?

Avid Amoeba@lemmy.ca · 6 months ago

This is a good idea, but they probably need to figure out which lib is shitting the bed first. There’s too many libs to try otherwise.

DefederateLemmyMl@feddit.nl · 6 months ago

Maybe a corrupt download/copy of a library… Try a reinstall of say glibc ?

Doesn’t explain why it also crashes in an EndeavourOS live image…

lemming741@lemmy.world · 6 months ago

I had a 3700x that was doing that sort of thing. It seemed mostly random, but moving big files would crash it pretty often. It ran memtest86 for 3 days no problem. I replaced part by part, and it ended up being the CPU. I’d bought it second hand so it may have been abused.

NoisyFlake · 6 months ago

But if it’s a faulty CPU, wouldn’t it also crash on Debian?

Avid Amoeba@lemmy.ca · 6 months ago

Wild guess, there could be differences in compilation optimization that expose this hypothetical proc defect on Arch but not on Debian. Try a day or two of mprime as some others suggested.

DefederateLemmyMl@feddit.nl · 6 months ago

Try a day or two of mprime as some others suggested.

That wouldn’t necessarily reveal a faulty CPU or firmware. I used to have a 3600x that would sometimes crash on idle at low clocks but would run cinebench or geekbench all day and all night.

Avid Amoeba@lemmy.ca · 6 months ago

For sure. It would catch a subset of issues.

Possibly linux@lemmy.zip · 6 months ago

It probably would under the right circumstances. Its more of a flip of a coin.

lemming741@lemmy.world · 6 months ago

I would think so, but it sounds similar enough with the symptoms and very similar on the model CPU so I thought I’d mention it

DefederateLemmyMl@feddit.nl · 6 months ago

I’m pretty sure that it’s not hardware related

Random segfaulting is not something that “just happens” because of an OS misconfiguration, then if the same problem happens on Arch as well as on a clean EndeavourOS live image it convinces me that it is in fact hardware related somehow. As you have already replaced the RAM, my guess is CPU or motherboard issue.

Zen2/B450 is a widely used and well supported configuration on Linux that you normally shouldn’t have issues with, but Zen2 CPUs are rather notorious for having fragile memory controllers, and sometimes dodgy AGESA firmware releases that can cause issues on some CPUs. I used to have a 3600X myself that started crashing at idle around a particular firmware release of my motherboard, and it was fixed by a subsequent release.

BTW the fact that it doesn’t happen on Debian doesn’t necessarily mean that Arch is the culprit. It could just be that Debian is not triggering the fault because of different, perhaps more conservative, compiler optimizations.

As a last ditch effort, you could try resetting your entire UEFI (bios) settings to default, preferably by pulling the CMOS battery.

BTW, is it only GUI applications that are segfaulting? Or other programs as well? Do you have an old spare GPU you can test with?

NoisyFlake · 6 months ago

I already did a UEFI reset, that didn’t help. As far as I can tell, it’s only GUI applications, I haven’t seen a segfault for something else so far. Unfortunately I don’t have any other GPU right now.

It seems that a solution was found though (at least for now, it didn’t crash since a few hours) here: https://lemm.ee/comment/8161085

DefederateLemmyMl@feddit.nl · 6 months ago

Glad to hear that disabling PBO helped, but it does indicate that something may not be entirely healthy with your CPU (or with the way the motherboard is driving it, that also can’t be excluded)

vzq@lemmy.blahaj.zone · 6 months ago

Can you enable core dumps and get stack traces? From there you should be able to figure out which shared library is broken.

NoisyFlake · 6 months ago

Uhm, isn’t that what can be found at the end of the journalctl log I posted? Or are you talking about something different?

The Doctor@beehaw.org · 6 months ago

Are you keeping an eye on system temperature?

NoisyFlake · 6 months ago

Yeah, temperatures are usually between 40-50 °C, so that should be fine.

The Doctor@beehaw.org · 6 months ago

Yeah, that should be fine.

Anything in the kernel message buffer? dmesg -T | less

NoisyFlake · 6 months ago

I’m not sure, here’s the entire dmesg output: https://pastebin.com/MZfhB0xK

The Doctor@beehaw.org · 5 months ago

I’m not seeing anything relevant to lockups or crashes in there. Pretty boring logs.

Ludrol@szmer.info · 6 months ago

I would guess that this is ~~CPU~~ SSD issue you ran an live debian image from an usb and did not encounter any crashes.

NoisyFlake · 6 months ago

But I also ran a live EndeavourOS from USB and the same crashes happened.

Possibly linux@lemmy.zip · edit-2 6 months ago

Do you have a old computer to test the software? You could pull out the drive and put in a known good system to test. I also would use a live system for a while to see if it has problems.

This sounds like a hardware problem. Maybe your ram controller on the CPU is faulty? Try reseating the CPU and check for bent pins.

If its a software problem you could also go the nuclear option and start over. Its a pain but it might be worth it. I don’t know on Arch but with Debian you can reinstall all packages. Doing that should wipe out any corruption.

NoisyFlake · 6 months ago

Starting over won’t probably fix anything, since even the EndeavourOS live image has the segfaults. Of course I could just start over on Debian, but I really like Arch and would only switch as a last resort.

I have another computer where I can test it, yes. It’s probably enough to run EndeavourOS live for a while, but then again, I’m 99% sure that no crashes are going to happen, otherwise the EndeavourOS forums would be flooded with this issue.

Possibly linux@lemmy.zip · 6 months ago

If your live system is crashing it is definitely a hardware problem. Can you post dmesg?

NoisyFlake · 6 months ago

Here you go: https://pastebin.com/MZfhB0xK

Possibly linux@lemmy.zip · 6 months ago

There is one line that catches my attention.

ccp 0000:2b:00.1: ccp: unable to access the device: you might be running a broken BIOS

This theoretically shouldn’t causes crashes but from my research it looks like AMD CCP can cause system instability in some cases. I would update your bios to the latest release and if that doesn’t fix it you should try disabling AMD CCP in bios as I doubt you need it anyway.

vildis@lemmy.dbzer0.com · 6 months ago

Could you try an older endeavour os image?

This sounds very much like a driver/firmware/hardware issue

CameronDev@programming.dev · 6 months ago

Try increasing RAM voltage? Might make it more stable under load. I had a similar issue, clean memtest, but games would randomly crash. Increasing RAM voltage fixed it.

NoisyFlake · 6 months ago

What voltage should I try? It’s currently at 1.35V, and I’ve read somewhere that this is the highest “safe” voltage.

CameronDev@programming.dev · edit-2 6 months ago

I jumped to 1.4V which afaik is safe. But i cant guarentee anything. Going up slowly might be better, but stop at 1.4?

Corsair says 1.4 is safe: https://help.corsair.com/hc/en-us/articles/360052448851-Tips-on-safely-overclocking-memory

Avid Amoeba@lemmy.ca · 6 months ago

Crashes on Arch, doesn’t crash on Debian:

Debian > Arch

Sanguine@lemmy.world · 6 months ago

Not the point of this thread.

Avid Amoeba@lemmy.ca · 6 months ago

Of course.