this post was submitted on 23 Nov 2023

1 points (100.0% liked)

Data Hoarder

221 readers

1 users here now

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time (tm) ). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

founded 2 years ago

MODERATORS

communick@selfhosted.forum

Heads up for a data corruption bug in ZFS, few versions affected, might have started at 2.1.x, but many reports on 2.2.x (github.com)

submitted 2 years ago by vitzli-mmc@alien.top to c/datahoarder@selfhosted.forum

28 comments fedilink hide all child comments

top 28 comments

sorted by: hot top controversial new old

[–] dr100@alien.top 1 points 2 years ago (4 children)

This is why you ALWAYS need INDEPENDENT backups. You can think all day long about detecting bitrot, and how well you're protected against X drive failures but then something comes from the side and messes up your data in a different way than you've foreseen.

[–] katbyte@alien.top 1 points 2 years ago

Also independent way to verify files. I cfv everything before a big move and then after to check

[–] henry_tennenbaum@alien.top 1 points 2 years ago

Wait. Are you trying to say that raid is not a backup?

[–] quint21@alien.top 1 points 2 years ago

something comes from the side and messes up your data in a different way than you've foreseen.

This happened to me years ago. Naïvely thinking SnapRAID protected me against the likelihood of a drive failure. I wasn't prepared for two drives failing simultaneously due to a power supply catastrophically failing (smoke, sparks) and frying the drives as it died.

It was an expensive lesson: I had to send one drive off for data recovery, and after I got it back I used SnapRAID to restore the remaining drive. Independent backups (and multiple parity drives) is the way.

[–] imakesawdust@alien.top 1 points 2 years ago (1 children)

The problem here is that those independent backups would also be corrupted. As I understand from the github discussion, the issue might be a bug that causes ZFS to not recognize when a page is dirty and needs to be flushed and is somehow triggered when copying files using a new-ish optimization that has been implemented in Linux and *BSD kernels? If you trigger the bug while copying a file, the original remains kosher but the new file has swaths of bad data. Any backup made after this point would contain both the (good) original and (corrupted) copied file.

[–] dr100@alien.top 1 points 2 years ago

The point is you'll still have the originals, which you might in the meantime have removed (for example if one would reorganize a huge collection and started by working on the reflinked copy and in the end removed the original, natural cleanup workflow, not many would think that you'd need to check the results after a reflinked nearly-instant copy, not even foresee that if there's some bitrot it'll come from THAT).

Sure, in this case snapshots would have worked just as well, but of course there are other cases in which they wouldn't have. Independent backups cover everything, well assuming you have enough history which is another discussion (I was considering to literally keep it forever after removing some old important file by mistake, but it becomes too daunting and too tempting to remove files removed 1,2,3 years ago).

[–] EchoGecko795@alien.top 1 points 2 years ago (1 children)

modinfo zfs | grep version

To quickly get the version installed.

[–] 3-2-1-backup@alien.top 1 points 2 years ago (1 children)

zfs --version also does the trick.

[–] tatiwtr@alien.top 1 points 2 years ago (1 children)

That did not work for me on ubuntu, but did on my debian/proxmox distribution

proxmox:

zfs-0.8.3-pve1

zfs-kmod-0.8.3-pve1

ubuntu:

version: 0.6.5.6-0ubuntu26

srcversion: 0968F94158D646E259D86B5

vermagic: 4.4.0-142-generic SMP mod_unload modversions retpoline

looks like im using an ancient version and am ok?

[–] gabest@alien.top 2 points 2 years ago

I also use a version close to that, 0.something. See absolutely no reason to upgrade. It just works. It's the version that has the fast scrub already.

[–] 3-2-1-backup@alien.top 1 points 2 years ago

Makes me really glad I almost never bother to upgrade my pool flags!

(I mean seriously, I can't think of the last time I had a use for new flags!)

[–] n3rt46@alien.top 1 points 2 years ago (1 children)

Is this not for OpenZFS in particular? To my knowledge OpenZFS and ZFS are separate.

[–] flaser_@alien.top 1 points 2 years ago

The bug was reported on OpenZFS, check the link from OP's post:
https://github.com/openzfs/zfs/issues/15526

[–] zhiryst@alien.top 1 points 2 years ago (3 children)

Can someone tell a dummy like me, if this impacts truenas core?

[–] iamcts@alien.top 1 points 2 years ago

It's hard to tell because I get this:

root@truenas[~]# zpool get version poolname

NAME PROPERTY VALUE SOURCE

poolname version - default

[–] realitycorp@alien.top 1 points 2 years ago

There is a thread going on at https://www.truenas.com/community/threads/silent-corruption-with-openzfs-ongoing-discussion-and-testing.114390/

Some users are reporting that it does affect Truenas (though it depends on the use case).

[–] EquivalentRisk3069@alien.top 1 points 2 years ago

TrueNAS-13.0-U6 (the current core version)

zfs version reports 2.1.13

so it should be clean.

[–] SlyFox125@alien.top 1 points 2 years ago (2 children)

Anyone have any ideas for checking for this issue in existing backups?

[–] Is-Not-El@alien.top 1 points 2 years ago (1 children)

Not confirmed but promising - https://github.com/openzfs/zfs/issues/15526#issuecomment-1810800004 and https://github.com/openzfs/zfs/issues/15526#issuecomment-1810819382

[–] SlyFox125@alien.top 1 points 2 years ago

Thank you. Upon reading further, the state of block cloning seems to be the major variable as to whether any corruption has occurred. However, there appears to remain a non-zero chance that such corruption could occur regardless of block cloning and dates back to 2.1.4/2.1.5 which were released in March/June of 2022.

[–] vitzli-mmc@alien.top 1 points 2 years ago (1 children)

Script at #15526 can somewhat check for a hole in the first 4K bytes of the file, but gives false positives, if script produces a syntax error in the last line - replace /bin/sh with /bin/bash or whatever the location of the BASH is.

Used it on part of my collection, found several zeroed-out files, but I strongly suspect they were full of zeroes before they hit ZFS, at least some files from 2009 were full of zeroes. Script gave multiple false positives (and one true positive on fully-zero file) on .iso files, suspect that they miss boot record.

[–] SlyFox125@alien.top 1 points 2 years ago (1 children)

Thank you. I've been keeping an eye on the thread to see if any consensus emerges regarding any better understanding of how the corruption manifests itself. It appears there is a possibility that a portion could be zeroed out and then new data written over it, giving the impression that all is well, but where the file is obviously still corrupt. It seems the best method is to have a list of checksums from known good files, but that obviously requires previous action that may or may not have occurred (obviously, most people never anticipated this and thus have no such list).

[–] vitzli-mmc@alien.top 1 points 2 years ago (1 children)

I was able to copy zipped 400GB zipped dump from the torrent, checksum it beforehand and after the move, no failures so far, at least at the beginning

[–] SlyFox125@alien.top 1 points 2 years ago

It appears the issue arises more when a ZFS file system is being used in a primary nature; e.g., reading and writing to it directly as a part of some active operation. Are you using it as a backup/archive, or as a primary partition where your OS and applications are writing to it directly? If it's the former, it would seem you're much more unlikely to encounter the issue.

[–] EVnegative@alien.top 1 points 2 years ago

It's too bad file system code isn't easy to verify. It would be great if there was a file system that was formally verified (https://en.wikipedia.org/wiki/Formal_verification).

[–] Is-Not-El@alien.top 1 points 2 years ago (1 children)

Ah heck I just updated my NAS VM to FreeBSD 14.

Anyone running FreeBSD 14, make sure vfs.zfs.bclone_enabled is set to 0.

[–] grahamperrin@alien.top 1 points 2 years ago

… FreeBSD 14, …

Not only 14 …

[–] bobj33@alien.top 1 points 2 years ago

It looks like the source of the bug is identified and fixed.

https://github.com/openzfs/zfs/pull/15579/commits/679738cc408d575289af2e31cdb1db9e311f0adf

[2.2] dnode_is_dirty: check dnode and its data for dirtiness #15579