this post was submitted on 15 Jun 2023
160 points (100.0% liked)
Technology
37748 readers
254 users here now
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It’s important here to think about a few large issues with this data.
First Data Storage. Other people in here are talking about decentralizing and creating fully redundant arrays so multiple copies are always online and can be easily migrated from one storage tech to the next. There’s a lot of work here not just in getting all the data, but making sure it continues to move forward as we develop new technologies and new storage techniques. This won’t be a cheap endeavor, but it’s one we should try to keep up with. Hard drives die, bit rot happens. Even off, a spinning drive will fail, as will an SSD with time. CD’s I’ve written 15+ years ago aren’t 100% readable.
Second, there’s data organization. How can you find what you want later when all you have are images of systems, backups of databases, static flat files of websites? A lot of sites now require JavaScript and other browser operations to be able to view/use the site. You’ll just have a flat file with a bunch of rendered HTML, can you really still find the one you want? Search boxes wont work, API calls will fail without the real site up and running. Databases have to be restored to be queried and if they’re relational, who will know how to connect those dots?
Third, formats. Sort of like the previous, but what happens when JPG is deprecated in favor of something better? Can you currently open up that file you wrote in 1985? Will there still be a program available to decode it? We’ll have to back those up as well… along with the OSes that they run on. And if there’s no processors left that can run on, we’ll need emulators. Obviously standards are great here, we may not forget how to read a PCX or GIF or JPG file for a while, but more niche things will definitely fall by the wayside.
Fourth, Timescale. Can we keep this stuff for 50 yrs? 100 yrs? 1000 yrs? What happens when our great*30-grand-children want to find this info. We regularly find things from a few thousand years ago here on earth with archeological digsites and such. There’s a difference between backing something up for use in a few months, and for use in a few years, what about a few hundred or thousand? Data storage will be vastly different, as will processors and displays and such. … Or what happens in a Horizon Zero Dawn scenario where all the secrets are locked up in a vault of technology left to rot that no one knows how to use because we’ve nuked ourselves into regression.
I guess I can talk a bit about the first and third points for my personal archiving (certainly not on a global scale).
For data storage data should be regularly be checked for bitrot and corruption, preferably with a file system that can heal itself if such a situation occurs. Personally I use ZFS RAIDZ with regular scrubs to sure that my data is bitperfect. Disks that regularly show issues are trashed, even if they appear to run fine and show good SMART status. For optical disks in a safe or something I reburn them every ten years or so even if they're still readable to keep the medium fresh.
I've actually known someone who had to painfully setup a Windows 95 computer in order to convert some old digital pictures from a equally old digital camera stored in a prop format. Obviously that's a no go. For my archives I try to use standard open formats like PNG, PDF, etc. that won't go away for a long time and can be reconverted as part of an archive update if the format starts to become obsolete. You can't just digitally archive everything and expect it to be easily readable after a hundred years. I don't do this but if space is limitless lossless format could be used (PNG for photos, FLAC for audio, etc.) so any conversions remain true to the original capture.
Actually I think TIFF or Adobe DNG are the lossless formats for photos.
TIFF is a classic storage format, but PNG is common for web images and isn't going away either. DNG is for RAW sensor output from professional cameras and is not used for edited and published images. However if you're archiving your photo collection or something than keep the DNGs!
There is an experimental storage format that can store large amounts of data in a fused quartz disc. The data will not degrade with time since the bits are physically burned into the quartz.