Hello Homelab! I’ve been working on upgrading my desktop/workstation (currently and will continue to run Proxmox, gaming Win10 VM with gpu passthrough, and assorted other VMs).

Previously I was running an i9-7920x, 96gb ram, dual 1080tis, 1tb NVME (and a lot more misc hardware). The new build is dual Epyc 7542s, 256gb ram, an Asrock Rack Rome2D16-2T mobo, dual 4tb U.2 NVMEs, and I’m probably just going to transfer over the current GPUs until I can afford new ones (spent enough already for right now lol).

Here’s my issue, I’ve spent at least 2-3wks on trying to get this system put together, and am STILL running into issues. I’ve NEVER had this many problems with a build, and I’m almost tearing my hair out trying to figure out what to do next. I apologize if this is scattered, but I’m going to try to go through what has happened so far.

Got the parts, put everything together (and for some reason didn’t think to start out with a minimal config) and naturally no post. Started removing stuff trying to go back to a minimal config and at some point throughout the process saw something on the screen for a brief moment, but nothing further, no post. After a long time, many threads read, and everything tried that I could think of, diagnosed it as bad mobo. Sent that back and got another.

New mobo arrives, get everything put together again for a minimal config (but both cpus, since I accidentally filled slot 2 first). Posts, and instantly tells me 3 of the 4 sticks of ram are bad. Take pictures, get an eBay return started (the ram was the cheapest I could find, about $50 per 64gb stick, I guess I got what I paid for). While waiting for this to get sorted out, order 4x more sticks of 64gb.

At this point the system posts reliably with a single stick in. MemtestX86 comes up clean, all is well.

Get the 4x sticks of ram (second order, different seller), put them in, and no post. Debug code “AD”. Start swapping sticks into different slots, eventually get past the debug code and it posts, only to tell me 2x of the 4x new sticks are bad.

Now I start double-checking everything including torque specs again, find that both chips are looser than they’re supposed to be (have a digital torque screwdriver). Was reading at around 6lb-in, upped to the recommended 13.5lbin or so (making sure to use the proper 1-2-3 pattern). 2 of the 4 ram sticks still show bad, so figure it indeed is just bad ram and I’m unlucky.

Using the 3x good sticks I have in total that allow the system to post, I run Memtest (192gb ram), and it runs for 24+ hours without a single error (ram good?). At this point I finally dust off my hands and go “yep, I’m finally out of this mess, just need to replace the bad ram and we’re good to go”. Or so I thought?

I boot up Proxmox after the ram test, all seems well. I restart, and no post, no debug code, debug 7 segment LEDs are completely off. IPMI still works. Turn the system off and back on several times, still nothing on the debug code, no post. Flip the switch on the power supply off and back on, power it up, Debug code AD again just like previously, mind you, no hardware changes between the 24+hr memtest going fine and now. Turn off, back on again, same, do this several times and eventually with no more hardware changes it posts fine, still sees all 192gb, and boots into Proxmox.

I don’t necessarily think this second Mobo is bad too. The 3 sticks of ram seem to be fine (if they were bad, I’m sure errors would appear in that 24hr stretch of memtest). At this point, are my CPUs bad? I saw memtest cycle through all cores and use all cores. Temps seem fine too. Torque specs are good. I’ve checked for extra standoffs in the case (read someone else had weird issues and it was caused by an extra standoff). None were found. Once the system posts/boots I haven’t noticed a single freeze, crash, or anything odd (although I obviously haven’t used it too much since I can’t trust it yet). But posting is super hit or miss, and I never know if I’m going to have to restart it once or 10 times to get it to post properly.

I’ve done literally EVERYTHING I can think of, and have no idea where to go from here. Anyone else go through anything like this? Any ideas what to do next?

  • merkuron@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Is your RAM on the QVL? Ryzen’s notorious pickiness about RAM carries over to TR and EPYC, too. One of the first things before POST and BIOS splash display is memory training. If it can’t get past that, something about memory needs adjustment. Have you tried downclocking it?

    • jacksonhill0923@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      I wonder if that’s it. So when I checked the specs originally it listed the below RAM types as compatible.

      • RDIMM: up to 64GB
      • LRDIMM: up to 128GB
      • RDIMM/LRDIMM-3DS: up to 256GB
      • NVDIMM-N: up to 32GB

      But when I check the QVL on ASRock’s site it only lists 10 different modules and all of them are RDIMMs (mine are LRDIMMs). The other thing I noticed is all RAM listed is between 2666-3200mhz while mine is 2400.

      • My first batch of RAM - Micron 64gb LRDIMMs, MTA72ASS8G72LZ
      • Second batch of RAM - Samsung 64gb LRDIMMs, M386A8K40BM1-CRC5Q

      First thing I notice this morning is because I was using the 1 good stick from the first batch, and 2 from the second, I was mixing vendors as far as RAM goes, which I’ve always heard is bad. Never had an issue in the past but to rule that out I’ve removed the Micron stick and currently just have the 2 working (memtest confirmed good) Samsung sticks. Try and post, stuck on debug indicator AD. So while the RAM may still be what’s causing me issues, I don’t think it was specifically mixing RAM that was stopping it from posting.

      Another note, according to the manual, code AD means “DXE_READY_TO_BOOT”. 50, 53, and 55 are the codes that are RAM related, invalid type, not detected, and not installed respectively. I feel like every time it successfully posts I see those codes appear for a split second and disappear as it cycles through a lot of different numbers, but when it hangs on AD it hasn’t hit them yet. Almost makes me feel like whatever is hanging it up is before it even checks for the RAM.

      Last thing, it occurred to me that maybe it was just taking an extra long time to boot? Every time it posts, it goes through the process relatively fast. Just in case I wasn’t being patient enough, I let it sit on AD for about 10 minutes now and no change, so it’s definitely stuck.