Sometimes there is a great article, some useful information or anything else on a web site you want to preserve. When downloading the page in the browser by using something like Save as ...
this doesn’t work properly all the time. But don’t worry, there is a solution for this.
How archiving works
Simply spoken it’s necessary to download the page only.
But what exactly does this mean? This means archiving the following components:
- website
- images
- javascript
- media
- and so on …
Today numerous sites consist of content delivered by various domains. That could be a content delivery network like akamai, a service provider for user tracking, or social media integration for e.g. Google or Facebook. Exactly this mashup makes it extremely difficult to correctly download a website.
For solving this challenge there is a tool called wget
that runs on a unix
commandline. Of course there are other tools that would solve the same thing, but no other tool is as basic and widely available. Furthermore wget
can be used for many, many other things but downloading a website. In case you are interested in all the power that is provided by wget
simply have a look into the manual.
Introduction to wget
The following section describes some basic operations supported by wget
. In order to choose which operation is reasonable please have a look at it before running a website download powered by wget
.
The section below describes some examples that might be helpful archiving a website. Furthermore it’s possible to run some kind of a batch download that archives full trees starting with one website.
In the table below there are some really useful commands for using wget
:
Command | Description |
---|---|
-k , --convert-links |
Convert links to the local files |
-p , --page-requisites |
Download all files necessary for proper site display |
-H , --span-hosts |
Load files from other domains, too. |
-r , --recursive |
Recursive, per default –level=5 |
-l , --level=depth |
Maximum recursion depth |
For offline useage it necessary to use at least -k
and -p
. The first command converts all links to local links. In case a file is now available within the download folder, all links to that file are linked to the downloaded one. The second command fetches all the content that is necessary by the website. For example javascript, images or media is downloaded.
Running wget
After running wget
all the data that was downloaded will be stored into one or numerous folders. The following section describes downloading the wget
manual that is located at the following url: https://www.gnu.org/software/wget/manual/wget.html
.
For running the download simply run
wget -kp https://www.gnu.org/software/wget/manual/wget.html
.
After that there will be a folder named www.gnu.org
. You can find the manual in the subfolder www.gnu.org/software/wget/manual
. There will be file called wget.html
.
The requested website ist stored within that file. Furthermore all the necessary assets that are required for representing the website are stored within that folder structure. Such assets could be a .css
file containing stylesheets, javascript within a .js
file or simply images that are used within the page. In case you want to open that page later ( of course you want this 😉 ) simply open the www.gnu.org/software/wget/manual/wget.html
file with your browser and get a working copy of the desired content. In this case it’s a html representation of the wget
manual.
Examples
This describes a number of scenarios that might be possible.
Archiving a website
In case you want to download one website only
wget -k -p http://www.example.com/page.html
This only downloads the page itself and all the content that’s necessary for a proper offline display of site. In case a website requires data located on other domains this might stop the archived website working.
Archiving a multi domain website
E.g. there is some content stored on other domains too, this option is necessary. For that situation simply add the -H
parameter in order to download content that is located in other domains too. Modern website often consist of multi domain content. So this is the safest solution for archiving a website. Very often content like fonts, javascript or video is provided by completely different domains. Videos maybe are provided by youtube, fonts by google apis or javascript by the framework website itself.
wget -H -k -p http://www.example.com/page.html
The disadvantage of using this option might be amount of data that will be downloaded. This results in an extended download time as well as an increased amount of data required by the download.
Downloading numerous websites
Batch downloading a website and all linked pages.
This downloads the website like in the example above, but it also does the same for all the linked content in the first order. For example there is a link to http://www.example.com/second.html
, this page will be archived too. When aiming to archive a complete manual that consists of numerous pages this might be the preferred solution.
wget -r --level=1 -H -k -p http://www.example.com/page.html
Conclusion
It’s quite simple to archive a website. A mashup used for composing a website distributes the content of a website over numerous domains. But there is a solution for this too. wget
is a powerful tool that empowers to fully download single website and a tree of websites for conveniently downloading a lot of connected web pages that are connected by one site. Furthermore there is no need to have a special tool for reading the website archives.