We're losing our digital history. Can the Internet Archive save it?

54 points by yamrzou 2 days ago | 8 comments

> Research shows 25% of web pages posted between 2013 and 2023 have vanished.

I’ve been personally working on a project over the past year which addresses the exact issue: https://linkwarden.app

An open-source [1] bookmarking tool to collect, organize and preserve contents on the internet.

[1]: https://github.com/linkwarden/linkwarden

ryanpandya 18 hours ago | root | parent |

Is there a way it could eventually function like a P2P version of archive.org, so that if anyone has a copy of a page (at a point in time I suppose?), it's available to anyone in the network?

If I understand correctly, right now it's more of a self hosted tool for personal archiving (which is great -- I'm a user myself), but something even more resilient harnessing network effects would be great to see.

geye1234 2 days ago | prev | next |

I've been trying to download various blogs, on blogspot.com and wordpress.com, as well as a couple now only on archive.org, using Linux CLI tools. I cannot make it work. Everything either seems to miss css, or jumps the wrong number of links, or stops arbitrarily, or has some other problem.

If I had a couple of days to devote to it entirely, I think I could make it work, but I've had to be sporadic, although it's cost me a ton of time cumulatively. I've tried wget, httrack, and a couple of other more obscure tools -- all with various options and parameters of course.

One issue is that blog info is duplicated -- you might get domainname.com/article/article.html; domainname.com/page/1; and domainname.com/2015/10/01; all of which contain the same links. Could there be some vicious circularity taking place, causing the downloader to be confused about what it's done and what it has yet to do? I wouldn't think so, but static, non-blog pages are obviously much simpler than blogs.

Anyway, is there a known, standardized way to download blogs? I haven't yet found one. But it seems such a common use case! Does anybody have any advice?

toomuchtodo a day ago | root | parent |

https://github.com/ArchiveTeam/grab-site

HocusLocus 2 days ago | prev | next |

I've been trying to extract historycommons.org from wayback and it is an uphill battle, even to grab the ~198 pages it says it collected. Even back in the days after 9/11 when it rose to prominence I was shuddering at the site's dynamically served implementation. These were the days of Java and they loaded down the server side with CPU time when it'd rather be serving static items... from REAL directories. With REAL If-Modified-Since: virtual support file attributes set from the combined database update times ... which seems to have gone by the wayside on the Internet completely.

Everything everywhere is now Last-Modified today, now, just for YOU! Even if it hasn't changed. Doesn't that make you happy? Do you have a PROBLEM with that??

Everything unique at the site was after the ? and there was more than one way to get 'there', there being anywhere.

I suspect that many tried to whack the site then finally gave up. I got a near-successful whack once after lots of experimenting, but said to myself then "This thing will go away, and it's sad".

That treasure is not reliably archived.

Suggestion: Even if the whole site is spawned from a database, choose a view that presents everything once and only once, and present to the world a group of pages that completely divulge the content with slash-separators only /x/y/z/xxx.(html|jpg|etc) with no duplicitous tangents IF the crawler ignores everything after the ? ... and place actual static items in a hierarchy. The most satisfying crawl is one where you can do this, knowing that the archive will be complete and relevant and there is no need to 'attack' the server side with processes-spawning.

alganet 2 days ago | prev |

One question seems obvious:

With AIs and stuff, are we saving humanity's digital history, or are we saving a swarm of potentially biased auto-generated content published by the few that can afford the large scale deployment of LLMs?

jmclnx 2 days ago | root | parent |

Maybe recent pages this is true, but in 2013, AFAIK AI was not yet a thing.

But too bad these sites generated by AI are not tagged in some manner. But an argument could be made that these AI pages are reviewed by a person before they hit the WEB. One can hope anyway :)

alganet 2 days ago | root | parent |

Let's consider SEO practices as an example. SEO spam probably IS very likely always reviewed by a person before publishing.

It is still biased and garbage anyway, because the reviewer only cares about its effect on affecting ranking algorithms, not the heritage of digital human history.