Hacker Newsnew | past | comments | ask | show | jobs | submit | 19-84's commentslogin

there is nearly 10TB of youtube metadata available on archive.org https://archive.org/details/youtube-metadata


These show as unavailable/with lock icons for me. Is there some process to download locked content from IA?


redd-archiver uses postgres full text search. for static search you could use lunr.js


the torrent has data for the top 40,000 subs on reddit. thanks to watchful1 splitting the data by subreddit, you can download only the subreddit you want from the torrent


I am going to be honest and this looks really cool.

40,000 subs are good numbers and I hope that the number can be spread to even more subreddits

Perhaps we can finally migrate all or much of the data to lemmy instances as well to finally get the lemmy instance up and running as well.

Thank you for creating this. It opens up a lots of interesting opportunities.


the data from 2025-12 has been released already, it is usually released every month, it just needs to be split and reprocessed for 2025 by watchful1. i will probably eventually add support for importing data from the monthly arctic shift dumps so that archives can be updated monthly.

https://github.com/ArthurHeitmann/arctic_shift/releases

Arctic Shift https://academictorrents.com/browse.php?search=RaiderBDev

Watchful1 https://academictorrents.com/browse.php?search=Watchful1


Is data web scrapped? Is reddit ok with that?..


I included a metadata dump of every subreddit found in the torrent. it includes a status field which will show of a subreddit is private along with a much more details

data catalog readme: https://github.com/19-84/redd-archiver/blob/main/tools/READM...

reddit data: https://github.com/19-84/redd-archiver/blob/main/tools/subre...


ive created tooling for an instance registry and team based leaderboard. the API has function to support this as well, so that we can collectively host archives in a decentralized and distributed manner

registry readme: https://github.com/19-84/redd-archiver/blob/main/docs/REGIST...

register instances: https://github.com/19-84/redd-archiver/blob/main/.github/ISS...


thank you for your comment, some example dot files were not copied in my original repo, they have now been added.

https://github.com/19-84/redd-archiver/commit/0bb103952195ae...

the docs have been updated with mkdir steps

https://github.com/19-84/redd-archiver/commit/c3754ea3a0238f...


Cheers. I checked the updated steps.

This is still missing creating the `output/.postgres-data` dir, without which docker compose refuses to start.

After creating that manually, going to http://localhost/ shows a 403 Forbidden page, which makes you believe that something might have gone wrong.

This is before running `reddarchiver-builder python reddarc.py` to generate the necessary DB from the input data.


I've updated the workflow and added a placeholder page that will serve before archives are created. thanks again! https://github.com/19-84/redd-archiver/commit/0dfd505ca81cb2...


thank you for your comment, I will support any platform that has complete dataset available. I will take submissions for any complete datasets through github issues. https://github.com/19-84/redd-archiver/blob/main/.github/ISS...


the API and MCP server is very powerful ;)


I have also published sub statistics and profiling for each platform. these can be used to help identify which subs to prioritize for archiving.

reddit: https://github.com/19-84/redd-archiver/blob/main/tools/subre...

voat: https://github.com/19-84/redd-archiver/blob/main/tools/subve...

ruqqus: https://github.com/19-84/redd-archiver/blob/main/tools/guild...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: