Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Excuse my ignorance, but what does duplicated in this context mean? Multiple copies of the same file?


Multiple copies of a file, or parts of a file. These can be spatial (your archive contains an SVN checkout, so there's an extra copy of every file in the .svn/pristine directory) or temporal (you take regular backups, and many of the files haven't changed very much from one backup to the next).


Ah okay, thanks. Another question: Why wouldn't compression take care of that? Isn't the point of compression to compact as many repeating sequences as possible?


Deduplication is a form of compression, yes. Most forms of compression are "local" however -- looking to match data against bits from within the past few MB -- so they won't detect duplicated data spread across entire archives.


In order for compression to work, the data must be 'solid'. Which means to add or remove something from the archive, you must reprocess the entire archive. This isn't a very good model for backups, especially when you have lots of them. (As an aside, .zip files aren't 'solid', so two copies of a file won't compress well. This is also the reason why most archives on linux are done through tar, to create a single stream of data).

Tarsnap uses variable blocks (in such a way that inserting into the middle of a file creates minimal differences). If a new block is detected during a backup, it only needs to send that block, as the rest are already stored on the server. It also means that new archives can refer to the old blocks stored, allowing each archive to be independent, and unused blocks removed when the last archive using it is deleted. Before sending, blocks are compressed then encrypted, so compression can't really help since you might not have all the old data locally.

You can also deal with backups using a master and incremental diffs. This doesn't work well with the tarsnap model, as archives are no longer independent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: