Observations on Web Site Data

How compressible is web data? For a recent project of mine...
  • 135 GB of web pages downloaded.
  • 15 GB of actual data extracted.
  • 0.6 GB compressed data archives.
So, about a 25:1 compression ratio for text data.

Derived from the above:
  • 15/135 ~= 10%
So, approximately 90% of the web is crud.

Sturgeon's Law - ask for it by name!

1 comment:

Greg Prosmushkin said...
