Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

79GB for all of the English articles minus the media. That's smaller than I would have guessed. You can fit this large slice of our culture on a $20.99 flash drive and with 49GB left over. That seems like a good econo-cultural indicator, storage cost per wikipedia. I wish I could short that index.


When thinking about this sort of thing I always find it fun to think about information density perception. I could hand you a USB drive and it could either contain a significant chunk of the sum of human knowledge, taking you lifetimes to even skim through, or it could contain a 2.5 hour movie you'd think nothing of.

Multiple layers of things at work there of course but that's what makes it fun to think about.


>79GB for all of the English articles minus the media.

I think thats an error on the github, wikipedia_en_all_novid is all text + pictures, just no videos. Text alone is ~15GB zipped. My 2014 Media dump was ~76GB, so that 80GB for full text+media checks out.


Does wikipedia_en_all_novid really include pictures? Wouldn't that be many hundreds of GB?


I think just the pictures which are embedded in pages, not all media assets.


Still, that seems way too small to me considering there are ~6m articles.


Apparently I was wrong! They got it super small.

[ ] wikipedia_en_all_mini_2019-09.zim 2019-09-18 03:16 10G [ ] wikipedia_en_all_nopic_2018-09.zim 2018-09-26 16:43 35G [ ] wikipedia_en_all_novid_2018-06.zim 2018-07-18 21:21 77G [ ] wikipedia_en_all_novid_2018-10.zim 2018-11-06 12:43 78G


From what I gathered ZIM picture library is re compressed for lower quality/size.


If "without media" means without images as well, it is larger than i expected




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: