this post was submitted on 15 Nov 2023
1 points (100.0% liked)

Homelab

371 readers
3 users here now

Rules

founded 1 year ago
MODERATORS
 

What should I monitor/log and how should I monitor/log to determine why my headless NAS is often becoming unavailable?

The problem:

  • Another machine that depends on the NAS routinely has its services unavailable because the NFS mounts are no longer mounted.
  • When that happens, sometimes a sudo mount -a recovers them.
  • Other times, the NAS is not pingable, so I go to the physical host, plug in monitor/keyboard and find that I can't log in. The login screen is frozen, requiring hard reboot.
  • Often when I leave a monitor attached (VGA), I come back to a screen that says:
critical medium error, dev sda, sector 163776752 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2

I started a sudo smartctl -t long /dev/sda a few hours ago, and sometime since then, the server depending upon it no longer had NFS mounted. But a simple sudo mount -a resolved.

What the server was also doing when it had a network blip:

  • rclone was backing up to backblaze b2
  • Acting as NFS server for Plex/*arr media server
  • Acting as NFS storage for Proxmox machine (but no VMs or CTs running)

Pasted some zpool output below. Details about the machine:

  • Repurposed old hardware, just built this Debian 12 NAS a couple months ago

  • Operates as backup destination for other machines

  • Operates as media location for my Plex machine - other server that mounts the NAS via NFS.

  • P6X58D-E LGA 1366 motherboard, Intel X5670 CPU, 18 GB (3x4GB, 3x2GB triple channel)

  • 8 hard drives connected to LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

  • 10GbE to managed TP-Link switch through one port on Mellanox Connectx-3 MCX312A-XCBT EN

    ➜ sudo zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT nvr 5.45T 3.35T 2.10T - - 2% 61% 1.00x ONLINE - tank 70.9T 34.4T 36.5T - - 0% 48% 1.00x ONLINE -

    ➜ sudo zpool status -v pool: nvr state: ONLINE scan: scrub repaired 0B in 08:49:40 with 0 errors on Sun Nov 12 09:13:41 2023 config:

          NAME            STATE     READ WRITE CKSUM
          nvr             ONLINE       0     0     0
            mirror-0      ONLINE       0     0     0
              6T-75LN0J4  ONLINE       0     0     0
              6T-95A2PNV  ONLINE       0     0     0
    

    errors: No known data errors

    pool: tank
    

    state: ONLINE scan: scrub repaired 1M in 16:44:16 with 0 errors on Sun Nov 12 17:08:27 2023 config:

          NAME              STATE     READ WRITE CKSUM
          tank              ONLINE       0     0     0
            raidz1-0        ONLINE       0     0     0
              12T-5PGJ4A0D  ONLINE       0     0     0
              12T-Z2J26EBT  ONLINE       0     0     0
              12T-5PGHSZJC  ONLINE       0     0     0
            raidz1-1        ONLINE       0     0     0
              14T-9KG38U5L  ONLINE       0     0     0
              14T-9KG81HRL  ONLINE       0     0     0
              14T-9RGG5ZDC  ONLINE       0     0     0
    

    errors: No known data errors

you are viewing a single comment's thread
view the rest of the comments
[–] merkuron@alien.top 1 points 1 year ago (1 children)

I’ve had drive failures bring down entire systems. Replace sda and see if the problems continue.

Fair enough! Going to start with memtest, per another comment, and narrow things down one at a time - probably by removing sda next.