this post was submitted on 15 Jun 2023

160 points (100.0% liked)

Technology

42426 readers

235 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 4 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

coldredlight@beehaw.org

remington@beehaw.org

160

The Internet Is Failing The Website Preservation Test (archive.ph)

submitted 2 years ago by Gork@beehaw.org to c/technology@beehaw.org

66 comments fedilink hide all child comments

This is something that keeps me worried at night. Unlike other historical artefacts like pottery, vellum writing, or stone tablets, information on the Internet can just blink into nonexistence when the server hosting it goes offline. This makes it difficult for future anthropologists who want to study our history and document the different Internet epochs. For my part, I always try to send any news article I see to an archival site (like archive.ph) to help collectively preserve our present so it can still be seen by others in the future.

top 50 comments

sorted by: hot top controversial new old

[–] strainedl0ve@beehaw.org 20 points 2 years ago (2 children)

This is a very good point and one that is not discussed enough. Archive.org is doing amazing work but there is absolutely not enough of that and they have very limited resources.

The whole internet is extremely ephemeral, more than people realize, and it's concerning in my opinion. Funny enough, I actually think that federation/decentralization might be the solution. A distributed system to back-up the internet that anyone can contribute storage and bandwidth to might be the only sustainable solution. I wonder.if anyone has thought about it already.

[–] mindfulzombie@indieweb.social 2 points 2 years ago

@strainedl0ve There is always https://ipfs.tech

load more comments (1 replies)

[–] RealAccountNameHere@beehaw.org 15 points 2 years ago (1 children)

I worry about this too. I've always said and thought that I feel more like a citizen of the Internet then of my country, state, or town, so its history is important to me.

[–] Gork@beehaw.org 4 points 2 years ago (2 children)

Yeah and unless someone has the exact knowledge of what hard drive to look for in a server rack somewhere, tracing an individual site's contents that went 404 is practically impossible.

I wonder though if Cloud applications would be more robust than individual websites since they tend to be managed by larger organizations (AWS, Azure, etc).

Maybe we need a Svalbard Seed Vault extension just to house gigantic redundant RAID arrays. 😄

[–] RealAccountNameHere@beehaw.org 4 points 2 years ago (1 children)

This isn't directly related to your comment, but you seem so smart, and I got to say that is definitely one thing I'm enjoying on this website over Reddit! :-)

[–] Gork@beehaw.org 2 points 2 years ago

Thanks ^_^ I don't consider myself brilliant or anything but I appreciate your compliment! The thing I like the most is that everyone is so friendly around here, yourself included ☺️

[–] jmp242@sopuli.xyz 3 points 2 years ago

We're actually well beyond RAID arrays. Google CEPH. It's actually both super complicated and kind of simple to grow to really large storage amounts with LOTS of redundancy. It's trickier for global scale redundancy, I think you'd need multiple clusters using something else to sync them.

I also always come back to some of the stuff freenet used to do in older versions where everyone who was a client also contributed disk space that was opaque to them, but kept a copy of what you went and looked at, and what you relayed via it for others. The more people looking at content, the more copies you ended up with in the system, and it would only lose data if no one was interested in it for some period of time.

[–] xray@beehaw.org 14 points 2 years ago* (last edited 2 years ago) (3 children)

Yeah it’s funny how I always got warned about how “the internet is forever” when it comes to being care about what you post on social media, which isn’t bad advice and is kinda true, but also really kinda not true. So many things I’ve wanted to find on the internet that I experienced like 15 years ago are just gone without a trace.

[–] buckykat@lemmy.fmhy.ml 7 points 2 years ago

Things you want to disappear will last forever but things you want to keep will vanish

[–] squaresinger@feddit.de 6 points 2 years ago* (last edited 2 years ago)

The internet can be forever. If you mess up publicly enough, it will be forever (e.g. the aerial picture of Barbara Streisand's villa)

[–] parrot-party@kbin.social 3 points 2 years ago

It should be revised to "the Internet can be forever". There's no control over what persists and what doesn't, but some things really do get copied everywhere and live on in infamy.

[–] fenfalca@lemmy.one 13 points 2 years ago

Remember a few years ago when MySpace did a faceplant during a server migration, and lost literally every single piece of music that had ever been uploaded? It was one of the single-largest losses of Internet history and it's just... not talked about at all anymore.

[+] ArtVandelay@beehaw.org 10 points 2 years ago* (last edited 2 years ago) (1 children)

[deleted]

[–] Ludrol@szmer.info 3 points 2 years ago (1 children)

This comment gave me a really tough moral dilemma. On one side I want the best for you on the other I want a rule to preserve everything even if this is illegal, dangerous and uncomfortable.

There are multiple examples that I can think of that are dangerous for the individual (in power and without power) but it's not like you are in serfdom and must tile ground for your master. You are free enough man to move where you live. Maybe you are held hostage by your friends, family, house and job but that aren't things that can't be work around.

Also who should decide if something should be preserved? Is this game that has 50 players at it's peak and nobody has heard of it, and is two years old should be preserved? No? Then among us wouldn't be preserved.

I sadly conclude that to prevent the harm of many people by individual in power I need to allow a danger to an individual by archiving everything that is possible to archive.

[+] ArtVandelay@beehaw.org 3 points 2 years ago* (last edited 2 years ago) (1 children)

[deleted]

[–] Ludrol@szmer.info 2 points 2 years ago (1 children)

I don’t think sacrificing other people for some imaginary tomorrow is worthwhile, to be honest.

If this statement was without context I would 100% agree.

Bur reality isn't black and white. The consequences of this particular case are totally preventable without changing any rules about archiving.

Your imaginary danger exists the same way as my imaginary future. But you won't change place of living due to unfavorable cost benefit calculation but I also calculate cost benefit for the whole of humanity in keeping archives.

I think you are scared of loosing everything that you build up in your town. (Friends, family, house) due to to something that isn't happend yet. And you would secrafice a lot just to not feel scared of being forcefully driven out.

But I don't know you and might be wrong in the details but definitely I can Imagine someone in similar situation.

[–] HobbitFoot@thelemmy.club 7 points 2 years ago

Isn't that like a lot of older television shows? Lots of shows are lost as no one wanted to pay for tape storage.

[–] thejml@lemm.ee 7 points 2 years ago (2 children)

It’s important here to think about a few large issues with this data.

First Data Storage. Other people in here are talking about decentralizing and creating fully redundant arrays so multiple copies are always online and can be easily migrated from one storage tech to the next. There’s a lot of work here not just in getting all the data, but making sure it continues to move forward as we develop new technologies and new storage techniques. This won’t be a cheap endeavor, but it’s one we should try to keep up with. Hard drives die, bit rot happens. Even off, a spinning drive will fail, as will an SSD with time. CD’s I’ve written 15+ years ago aren’t 100% readable.

Second, there’s data organization. How can you find what you want later when all you have are images of systems, backups of databases, static flat files of websites? A lot of sites now require JavaScript and other browser operations to be able to view/use the site. You’ll just have a flat file with a bunch of rendered HTML, can you really still find the one you want? Search boxes wont work, API calls will fail without the real site up and running. Databases have to be restored to be queried and if they’re relational, who will know how to connect those dots?

Third, formats. Sort of like the previous, but what happens when JPG is deprecated in favor of something better? Can you currently open up that file you wrote in 1985? Will there still be a program available to decode it? We’ll have to back those up as well… along with the OSes that they run on. And if there’s no processors left that can run on, we’ll need emulators. Obviously standards are great here, we may not forget how to read a PCX or GIF or JPG file for a while, but more niche things will definitely fall by the wayside.

Fourth, Timescale. Can we keep this stuff for 50 yrs? 100 yrs? 1000 yrs? What happens when our great*30-grand-children want to find this info. We regularly find things from a few thousand years ago here on earth with archeological digsites and such. There’s a difference between backing something up for use in a few months, and for use in a few years, what about a few hundred or thousand? Data storage will be vastly different, as will processors and displays and such. … Or what happens in a Horizon Zero Dawn scenario where all the secrets are locked up in a vault of technology left to rot that no one knows how to use because we’ve nuked ourselves into regression.

[–] digitallyfree@kbin.social 1 points 2 years ago* (last edited 2 years ago) (2 children)

I guess I can talk a bit about the first and third points for my personal archiving (certainly not on a global scale).

For data storage data should be regularly be checked for bitrot and corruption, preferably with a file system that can heal itself if such a situation occurs. Personally I use ZFS RAIDZ with regular scrubs to sure that my data is bitperfect. Disks that regularly show issues are trashed, even if they appear to run fine and show good SMART status. For optical disks in a safe or something I reburn them every ten years or so even if they're still readable to keep the medium fresh.
I've actually known someone who had to painfully setup a Windows 95 computer in order to convert some old digital pictures from a equally old digital camera stored in a prop format. Obviously that's a no go. For my archives I try to use standard open formats like PNG, PDF, etc. that won't go away for a long time and can be reconverted as part of an archive update if the format starts to become obsolete. You can't just digitally archive everything and expect it to be easily readable after a hundred years. I don't do this but if space is limitless lossless format could be used (PNG for photos, FLAC for audio, etc.) so any conversions remain true to the original capture.

load more comments (2 replies)

[–] cmnybo@discuss.tchncs.de 1 points 2 years ago

There is an experimental storage format that can store large amounts of data in a fused quartz disc. The data will not degrade with time since the bits are physically burned into the quartz.

[–] archon@dataterm.digital 6 points 2 years ago* (last edited 2 years ago)

Long ago the saying was "be careful - anything you post on the internet is forever". Well, time has certainly proven that to be false.

There's things like /r/datahoarder (not sure if they have a new community here) that run their own petabyte storage archiving projects, some people are doing their part.

[–] altz3r0@beehaw.org 6 points 2 years ago* (last edited 2 years ago) (1 children)

I think preservation is happening, the issue lies in accessibility. Projects like Archive.org are the public ones, but it is certain that private organizations are doing the same, just not making it public.

This is also something that is my biggest worry about the Fediverse. It has tools to deal with it, but they are self-contained. No search engine is crawling the Fediverse as far as I've looked, and no initiative to archive, index and overall make the content of the Fediverse accessible is currently in place, and that's a big risk. I'm sure we will soon be seeing loss of information for this reason, if not already happened.

[–] Dee_Imaginarium@beehaw.org 2 points 2 years ago (1 children)

It's still fairly new, I'm confident we'll see fediverse crawlers before too long. Especially with all the attention it's getting and more developers turning their interests here. I also saw some talk about instance mirroring that would allow backups should an instance go down. Things are in motion.

Absolutely a problem at the moment but I'm not too worried for the future tbh.

[–] altz3r0@beehaw.org 1 points 2 years ago (1 children)

Oh yeah, my hopes are high, I already am quite fond of this new home. :)

[–] Dee_Imaginarium@beehaw.org 2 points 2 years ago* (last edited 2 years ago)

Same! Howdy instance neighbor! 😄

[–] Otome-chan@kbin.social 5 points 2 years ago (2 children)

This is why stuff like the internet archive exist: to try and preserve this content. The problem is that governments are trying to shut down the internet archive...

load more comments (2 replies)

[–] CherryClan@beehaw.org 4 points 2 years ago (3 children)

during the twitter exodus my friend was fretting over not being able to access a beloved twitter account's tweets and wanting to save them somehow. I told her if she printed them all on acid free paper she had a better chance of being able to access them in the future than trying to save them digitally

[–] chris@lemmy.sdf.org 2 points 2 years ago

sad and true

load more comments (2 replies)

[–] DeGandalf@kbin.social 4 points 2 years ago

In this aspect, the internet is closer to spoken language, than any written media. Even if you use a service to archive the things you find, it's still possible, that they shut down, too.

[–] lloram239@feddit.de 4 points 2 years ago* (last edited 2 years ago) (2 children)

Ultimately this is a problem that’s never going away until we replace URLs. The HTTP approach to find documents by URL, i.e. server/path, is fundamentally brittle. Doesn’t matter how careful you are, doesn’t matter how much best practice you follow, that URL is going to be dead in a few years. The problem is made worse by DNS, which in turn makes URLs expensive and expire.

There are approaches like IPFS, which uses content-based addressing (i.e. fancy file hashes), but that’s note enough either, as it provide no good way to update a resource.

The best™ solution would be some kind of global blockchain thing that keeps record of what people publish, giving each document a unique id, hash, and some way to update that resource in a non-destructive way (i.e. the version history is preserved). Hosting itself would still need to be done by other parties, but a global log file that lists out all the stuff humans have published would make it much easier and reliable to mirror it.

The end result should be “Internet as globally distributed immutable data structure”.

Bit frustrating that this whole problem isn’t getting the attention it deserves. And that even relatively new projects like the Fediverse aren't putting in the extra effort to at least address it locally.

[–] lucien@beehaw.org 2 points 2 years ago* (last edited 2 years ago) (2 children)

I don't think this will ever happen. The web is more than a network of changing documents. It's a network of portals into systems which change state based on who is looking at them and what they do.

In order for something like this to work, you'd need to determine what the "official" view of any given document is, but the reality is that most documents are generated on the spot from many sources of data. And they aren't just generated on the spot, they're Turing complete documents which change themselves over time.

It's a bit of a quantum problem - you can't perfectly store a document while also allowing it to change, and the change in many cases is what gives it value.

Snapshots, distributed storage, and change feeds only work for static documents. Archive.org does this, and while you could probably improve the fidelity or efficiency, you won't be able to change the underlying nature of what it is storing.

If all of reddit were deleted, it would definitely be useful to have a publically archived snapshot of Reddit. Doing so is definitely possible, particularly if they decide to cooperate with archival efforts. On the other hand, you can't preserve all of the value by simply making a snapshot of the static content available.

All that said, if we limit ourselves to static documents, you still need to convince everyone to take part. That takes time and money away from productive pursuits such as actually creating content, to solve something which honestly doesn't matter to the creator. It's a solution to a problem which solely affects people accessing information after those who created it are no longer in a position to care about said information, with deep tradeoffs in efficiency, accessibility, and cost at the time of creation. You'd never get enough people to agree to it that it would make a difference.

[–] LewsTherinTelescope@beehaw.org 3 points 2 years ago* (last edited 2 years ago)

Inability to edit or delete anything also fundamentally has a lot of problems on its own. Accidentally post a picture with a piece of mail in the background and catch it a second after sending? Too late, anyone who looks now has your home address. Child shares too much online and parent wants to undo that? No can do, it's there forever now. Post a link and later learn it was misinformation and want to take it down? Sucks to be you, or anyone else that sees it. Your ex post revenge porn? Just gotta live with it for the rest of time.

There's always a risk of that when posting anything online, but that doesn't mean systems should be designed to lean into that by default.

[–] lloram239@feddit.de 2 points 2 years ago (1 children)

but the reality is that most documents are generated on the spot from many sources of data.

That's only true due to the way the current Web (d)evolved into a bunch of apps rendered in HTML. But there is fundamentally no reason why it should be that way. The actual data that drives the Web is mostly completely static. The videos Youtube has on their server don't change. The post on Reddit very rarely change. Twitter posts don't change either. The dynamic parts of the Web are the UI and the ads, they might change on each and every access, or be different for different users, but they aren't the parts you want to link to anyway, you want to link to a specific users comment, not a specific users comment rendered in a specific version of the Reddit UI with whatever ads were on display that day.

Usenet did that (almost) correct 40 years ago, each message got an message-id, each message replying to that message would contain that id in a header. This is why large chunks of Usenet could be restored from tape archives and put be back together. The way content linked to each other didn't depend on a storage location. It wasn't perfect of course, it had no cryptography going on and depended completely on users behaving nicely.

Doing so is definitely possible, particularly if they decide to cooperate with archival efforts.

No, that's the problem with URLs. This is not possible. The domain reddit.com belongs to a company and they control what gets shown when you access it. You can make your own reddit-archive.org, but that's not going to fix the millions of links that point to reddit.com and are now all 404.

All that said, if we limit ourselves to static documents, you still need to convince everyone to take part.

The software world operates in large part on Git, which already does most of this. What's missing there is some kind of DHT to automatically lookup content. It's also not an all or nothing, take the Fediverse, the idea of distributing content is already there, but the URLs are garbage, like:

https://beehaw.org/comment/291402

What's 291402? Why is the id 854874 when accessing the same post through feddit.de? Those are storage locations implementation details leaking out into the public. That really shouldn't happen, that should be a globally unique content hash or a UUID.

When you have a real content hash you can do fun stuff, in IPFS URLs for example:

https://ipfs.io/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

The /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf part is server independent, you can access the same document via:

https://dweb.link/ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

or even just view it on your local machine directly via the filesystem, without manually downloading:

$ acrobat /ipfs/QmR7GSQM93Cx5eAg6a6yRzNde1FQv7uL6X1o4k7zrJa3LX/ipfs.draft3.pdf

There are a whole lot of possibilities that open up when you have better names for content, having links on the Web that don't go 404 is just the start.

[–] soiling@beehaw.org 2 points 2 years ago (1 children)

re: static content

How does authentication factor into this? even if we exclude marketing/tracking bullshit, there is a very real concern on many sites about people seeing the data they're allowed to see. There are even legal requirements. If that data (such as health records) is statically held in a blockchain such that anyone can access it by its hash, privacy evaporates, doesn't it?

load more comments (1 replies)

[–] Schrottkatze@kbin.social 3 points 2 years ago

A friend of mine talked about data preservation in the internet in a blog post, which I consider to be a good read. Sure, there's a lot lost, but as he sais in the blog post, that's mostly gonna be trash content, the good stuff is generally comparatively well archived as people care about it.

[–] aard@kyu.de 3 points 2 years ago

Another problem is that even if sites and their content stay up they often reorganize it for various reasons - often by importing old content into some new platform - and don't care about the URLs the content is available at. Which breaks all links to it.

Some pages at least try to show you a page with suggestions what you might've been going for, but I've also seen those less and less over the years.

For my stuff I've been making sure to keep links working for over two decades now - on my personal page you can still access everything similary to /cgi-bin/script.cgi?page even though that script and the cgi-bin directory as a whole has been gone for over a decade. But I seem to be pretty alone in efforts trying to keep things at stable locations.

[–] VeeSilverball@kbin.social 3 points 2 years ago

I've had some thoughts on, essentially, doing more of what historically worked; a mix of "archival quality materials" and "incentives for enthusiasts". If we only focus on accumulating data like IA does, it is valuable, but we soak up a lot of spam in the process, and that creates some overwhelming costs.

The materials aspect generally means pushing for lower fidelity, uncomplicated formats, but this runs up against what I call the "terrarium problem": to preserve a precious rare flower exactly as is, you can't just take a picture, you have to package up the entire jungle. Like, we have emulators for old computing platforms, and they work, but someone has to maintain them, and if you wanted to write something new for those platforms, you are most likely dealing with a "rest of the software ecosystem" that is decades out of date. So I believe there's an element to that of encoding valuable information in such a way that it can be meaningful without requiring the jungle - e.g. viewing text outside of its original presentation. That tracks with humanity's oldest stories and how they contain some facts that survived generations of retellings.

The incentives part is tricky. I am crypto and NFT adjacent, and use this identity to participate in that unabashedly. But my view on what it's good for has shifted from the market framing towards examination of historical art markets, curation and communal memory. Having a story be retold is our primary way of preserving it - and putting information on-chain(like, actually on-chain. The state of the art in this can secure a few megabytes) creates a long-term incentive for the chain to "retell its stories" as a way of justifying its valuation. It's the same reason as why museums are more than "boring old stuff".

When you go to a museum you're experiencing a combination of incentives: the circumstances that built the collection, the business behind exhibiting it to the public, and the careers of the staff and curators. A blockchain's data is a huge collection - essentially a museum in the making, with the market element as a social construct that incentivizes preservation. So I believe archival is a thing blockchains could be very good at, given the right framing. If you like something and want it to stay around, that's a medium that will be happy to take payment to do so.

[–] jeena@jemmy.jeena.net 3 points 2 years ago

For my own stuff I try really hard to host it myself and the oldest still surviving thing is from 2003 and it's still online https://paradies.jeena.net/artikel/webdesign

[–] Hedup@lemm.ee 2 points 2 years ago (1 children)

I don't think it's a problem. If everything or most of internet would be somehow preserved, future antropologists would have explonentially more material to go through, which will be impossible. Unless the number of antropologists grows exponentially, similarily how internet does. But then there's a problem, if the amount of antropologists grow exponentially, it's beceause the overall human population grows exponentially. If human population grows exponentially, then also its produced content on internet grows even more exponentialier.

You see, the content on the internet will always grow faster than the discipline of antropology. And it's nothing new - think about all the lost "history" that was not preserved and we don't know about. The good news is that the most important things will be preserved naturally.

[–] soiling@beehaw.org 11 points 2 years ago (2 children)

the most important things will be preserved naturally.

I believe this is a fallacy. Things get preserved haphazardly or randomly, and "importance" is relative anyway.

[–] fckgwrhqq2yxrkt@beehaw.org 2 points 2 years ago

In addition, who decides "importance"? Currently importance seems very tied to profitability, and knowledge is often not profitable.

load more comments (1 replies)

[–] kool_newt@beehaw.org 2 points 2 years ago (1 children)

Capitalism has no interest in preservation except where it is profitable. Thinking about the long-term future, archaeologist's success and acting on it is not profitiable.

[–] FuckFashMods@lib.lgbt 4 points 2 years ago (3 children)

Its not just capitalism lol

Preserving things costs money/resources/time. This happens in a lot of societies.

load more comments (3 replies)

[–] Osayidan@social.vmdk.ca 2 points 2 years ago

To be realistic we need to pick and choose what to keep and expend effort/resources on those chosen things.

Without a technological breakthrough in data storage at some point there's got to be some kind of triage done. We all generate more information now than ever before, and this trend just keeps increasing. With things like A.I, XR, the metaverse or other similar concepts it'll also get exponentially more insane how much data we generate. It's not realistic at the moment, technologically or financially, to keep all of it in multiple geographically distributed copies, in a format that will last forever. For a lot of people or organizations it's not even feasible to keep one copy in some cases due to costs.

To do otherwise we would need a breakthrough that enables insanely cheap, infinitely scalable storage, that is immune to corruption (physical or digital) and optionally immutable to prevent modification. It would have to function in such a way that any reasonably advanced civilization can use the basic laws of physics to figure out how it works and consume the contents without any context of what the devices are. It would also have to work regardless of how fragmented it is, to use terms of today's technology if they only find one hard drive out of what used to be a pool of 100, it still needs to work on some level.

It's an interesting thought experiment and hopefully there's some ridiculously smart people working on it.

[–] PM_ME_VINTAGE_30S@vlemmy.net 1 points 2 years ago (1 children)

Other historical artefacts like pottery, vellum writing, or stone tablets

I mean I could just smash or burn those things, and lots of important physical artifacts were smashed and burned over the years. I don't think that easy destructability is unique to data. As far as archaeology is concerned (and I'm no expert on the matter!), the fact that the artefacts are fragile is not an unprecedented challenge. What's scary IMO is the public perception that data, especially data on the cloud, is somehow immune from eventual destruction. This is the impulse that guides people (myself included) to be sloppy with archiving our data, specifically by placing trust in the corporations that administer cloud services to keep our data as if our of the kindness of their hearts.

load more comments (1 replies)

[–] m00njuic3@kbin.social 1 points 2 years ago

thankfully we do have people trying to archive things. sadly not everything will make it into that. just to much new stuff all the time to keep up with. but if we can keep the important and mostly important stuff

[–] Brecat5@kbin.social 1 points 2 years ago

It sucks that we already have internet lost media

load more comments