WARC'in the Crawler

Smerity · on Dec 22, 2023

In the distant past I was the lone engineer of Common Crawl almost a decade ago. Common Crawl heavily leverages the WARC format.

My favorite capability of the WARC format borrows from the fact that most compression formats can be written to allow random access. Compression formats such as `gzip` and `zstandard` allow multiple compressed streams to be stuck together and act during decompression as if it's one contiguous file.

Hence you can create multiple compressions and literally stick them together:

  $ echo cat > cat.txt
  $ echo dog > dog.txt
  $ zstd cat.txt dog.txt
  $ cat cat.txt.zst dog.txt.zst > catdog.zst
  $ zstdcat catdog.zst
  cat
  dog

For files composed of only a textual / clearly delimited format that means you can fairly trivially leap to a different offset assuming each of the inputs is compressed individually. You lose out on some amount of compression but random lookup seems a fairly reasonable tradeoff. Common Crawl was able to use this to allow entirely random lookups into web crawl datasets dozens / hundreds of terabytes in size without any change in file format for example and utilizing Amazon S3's support for HTTP Range requests[1].

Trading compression for random lookup is even more forgiving if you create a separate compression dictionary tailored toward your dataset. For web crawling you'd likely get you the majority of the compression gains back unless pages from the same website are sequentially written which is unlikely in most situations. The website's shared template/s would result in very high compression gains across files which you'd lose by allowing random lookup but most crawlers don't don't operate sequentially so local compression gains are likely smaller than larger.

[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...

electroly · on Dec 22, 2023

Isn't this a benefit you'd trivially get just by using .zip? I pull individual files out of large .zip archives in S3 using HTTP range requests; works exactly as you'd expect. You know the zip header is at the end of the file, and the header tells you the offset and length of the compressed entry data so you can request the range. Two requests if you've never seen the .zip before, one if you've got the zip header cached.

Smerity · on Dec 22, 2023

As mentioned it's trivial across the spread of compression algorithms supporting this type of behaviour (`gzip`, `zstandard`, `zip`, ...), the header in `zip` making it even more convenient as you note!

WARC as a format essentially states that unless you have good reason "record at a time" compression is the preferred[1]. The mixture of "technically possible" and "part of spec" is what makes it so useful - any generic WARC tool can support random access, there are explicit fields to index over (URL), and even non-conforming WARC files can be easily rewritten to add such a capability.

[1]: https://iipc.github.io/warc-specifications/specifications/wa...

stavros · on Dec 22, 2023

It occurs to me that you could stick a few bytes of header in the beginning of the ZIP file, to tell you the exact location of the header at the end of it, thus avoiding multiple lookups. It would even still be ZIP-compatible.

electroly · on Dec 22, 2023

Definitely. I take an alternative but similar approach: since I control the zip files, I can guarantee that the header is always within the last N kilobytes of the zip file (configurable value of N). I spend a HEAD request to get the length of the zip file and then walk backwards by N kilobytes. You would request the few bytes at the beginning instead of using that request to get the file length.

stavros · on Dec 22, 2023

How can you guarantee that it's always within N kilobytes, when N depends on the number of files in the zip?

electroly · on Dec 23, 2023

If you're creating the zips in the first place, you can just check and see how big the headers are when you create them. If you happen to get N wrong, you can request another chunk, but obviously it's nice to avoid multiple requests to get the header. For my use case, the number of files is small and relatively consistent between zips so a generous value of 64KB ended up working great.

kingforaday · on Dec 22, 2023

If anyone's interested in web crawling technology, check out Heretrix [1], been around since 2004 and while not the most performant it has incorporated many responsible disciplines in the design and as this article pointed out, WARC format.

1. https://heritrix.readthedocs.io

fosstrack · on Dec 22, 2023

Second that. Anyone interested in studying web crawler tech should definitely take a look at Heritrix. I had used it extensively when it was still in 2.x. They got so many things right about writing well-behaved and fault tolerant crawlers. Plus the code is very modular, and extensible, if you know some Java. The other popular option then was Apache Nutch, but it had too much hadoop baggage.

marginalia_nu · on Dec 22, 2023

Hadoop is a bit of a nuisance in this general corner of Java. It's got a propensity for integrating deeply with cluster adjacent technology in a way that is very difficult to root out.

Kind of a pity since it has the effect of making things that could be very easy, such as reading and writing parquet files, much harder than it needs to be in Java.

abracadaniel · on Dec 22, 2023

Its performance shines in larger scales. It’s designed for politeness to individual domains, but scales out well for very wide crawls of many domains. It’s pretty much endlessly configurable, but not the easiest to learn.

sergiomattei · on Dec 22, 2023

I wish Apple would’ve open sourced the .webarchive format.

Nothing beats the user experience: Cmd-S in Safari and select “Web Archive”. It’s downloaded in permanent copy, indexed by Spotlight and accessible on all your devices.

I use it for collecting recipes around the web. However, I’m a bit concerned about data longevity. I’ve tried other more open formats (loved Singlefile) but none have the UX and support that this has. It’s so simple (as it should be).

wizzwizz4 · on Dec 22, 2023

> Nothing beats the user experience: Cmd-S in Safari and select “Web Archive”.

Most browsers support MHTML: IE5's format, which is based on the .eml format. I think Firefox and Safari are the only major browsers that don't.

There's also the Mozilla Archive Format, which never really got popular for some reason: https://www.amadzone.org/mozilla-archive-format/ .

> I wish Apple would’ve open sourced the .webarchive format.

I mean… it's not like it's a secret. It's just very Apple-specific: binary plist and some other stuff.

lawik · on Dec 22, 2023

If you are working a lot with those parquet files it might be worth looking at Apache Arrow which is an in-memory/wire format for working with columnar data. It has a lot of good support for parquet from what I gather and is really focused on allowing efficient wrangling of data. Zero-copy and all that.

Note: no affiliation, just in a deep rabbit hole on data

ferfumarma · on Dec 22, 2023

  The WARC (Web ARChive) file format offers a convention for concatenating multiple
  resource records (data objects), each consisting of a set of simple text headers 
  and an arbitrary data block into one long file. The WARC format is an extension 
  of the ARC file format (ARC) that has
  traditionally been used to store “web crawls” as sequences of content blocks
  harvested from the World Wide Web. Each capture in an ARC file is preceded by a
  one-line header that very briefly describes the harvested content and its length.
  This is directly followed by the retrieval protocol response messages and content.
  The original ARC format file has been used by the Internet Archive (IA) since 1996
  for managing billions of objects, and by several national libraries.

keyle · on Dec 22, 2023

Love reading updates from marginalia. Always good technical posts of a DIY solution with good balance between efficient and good enough.

gary_0 · on Dec 22, 2023

The project is quite exciting! It gives me hope that people are working on new things to get us out of this period of Web stagnation.