These classnotes are depreciated. As of 2005, I no longer teach the classes. Notes will remain online for legacy purposes

UNIX01/Wget And Curl

Classnotes | UNIX01 | RecentChanges | Preferences

wget

GNU Wget is a utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

Wget is non-interactive, meaning that it can work in the background, while the user is not logged on. This allows you to start a retrieval and disconnect from the system, letting Wget finish the work. By contrast, most of the Web browsers require constant user's presence, which can be a great hindrance when transferring a lot of data.

Wget can follow links in HTML pages and create local versions (mirrors) of remote web sites, fully recreating the directory structure of the original site. This is sometimes referred to as ``recursive downloading.'' While doing that, Wget respects the Robot Exclusion Standard (/robots.txt). Wget can be instructed to convert the links in downloaded HTML files to the local files for offline viewing.

Wget has been designed for robustness over slow or unstable network connections; if a download fails due to a network problem, it will keep retrying until the whole file has been retrieved. If the server supports regetting, it will instruct the server to continue the download from where it left off.

The basic syntax for wget is

 wget [option]... [URL]...

If you simply want to download the file http://www.foo.com/nano.tar.gz, you could grab it using wget thusly:

 $ wget http://www.foo.com/temp/nano.tar.gz

Wget supports many options and features, for which you should consult its man page. However, some particularily useful ones for various types of administrators are as follows:

  • --spider
This makes wget act as a robot spider, indexing its way through web pages and checking if the links or pages exist but not actually download them. This is very useful if you are a web server administrator and wish to check your site for dead links. You can even use it to check your bookmarks file for your browser:
   wget --spider --force-html -i bookmarks.html

  • -N
Turns on time-stamping.
  • -r
This causes wget to grab web-pages recursively, allowing local mirrors to be made. It can be combined with the following other options:
    • -l
    -l specifies the maximum recursion depth. It defaults to 5 levels.
    • -k
    -k tells wget to convert the links after downloading. This will convert the paths orginally specified to local ones. It's useful for making web-backups or even for fixing unwanted absolute links in a web-site.
    • -m
    Turns on options suitable to web-site mirroring (equivalent to "-r -N -l inf -nr").

Though wget is a useful command line utility, more often than not administrators find it is a very valuable scripting element. Scripts which monitor web-sites for defacement, periodically check for dead links, or even automatically backup web-sites are easily written and set to run at a given interval.

curl

Curl is somewhat of a "spiritual successor" to Wget. It is more featureful than normal wget, and is also written more generally. While wget is used to fetch files from HTTP, HTTPS, or FTP servers, curl can be used to get files from nearly any communications infrastructure (at the time of this writing, this includes HTTP, HTTPS, FTP, GOPHER, DICT, TELNET, LDAP or ordinary files). Curl provides a large amount of options such as upload abilities, authentication, proxies, kerberos, HTTP PUT and POST, and cookie handling.

Curl is also a programming library, that allows its features and functionality to be implimented into other applications. It has bindings in nearly every language you can think of (C, C++, Perl, Python, PHP, Java, Ruby, etc). It also supports multiple networking protocols (including TCP IPv6) and offers seamless cross-platform development across UNIX, Windows, Mac, OS/2 and just about any OS you'd want to use.

Of course, all of thus functionality comes at a price: curl is much less straight forward to use than wget.

Curl's basic syntax is very similar to wget:

 curl [options] [URL...]

If you wanted to again download the same file as above, the curl command would look very similar:

  $ curl http://www.foo.com/temp/nano.tar.gz

However, curl has extra functionality with respect to it's URL specifications. For example, you can specify specify multiple URLs or parts of URLs by writing part sets within braces as in:

 http://site.{one,two,three}.com

or you can get sequences of alphanumeric series by using [] as in:

  ftp://ftp.numericals.com/file[1-100].txt
  ftp://ftp.numericals.com/file[001-100].txt    (with leading zeros)
  ftp://ftp.letters.com/file[a-z].txt

It is possible to specify up to 9 sets or series for a URL, but no nesting is supported at the moment:

http://www.any.org/archive[1996-1999]/volume[1-4]part{a,b,c,index}.html

You can specify any amount of URLs on the command line. They will be fetched in a sequential manner in the specified order.

Curl will attempt to re-use connections for multiple file transfers, so that getting many files from the same server will not do multiple connects / handshakes. This improves speed. Of course this is only done on files specified on a single command line and cannot be used between separate curl invokes.

Again, curl has many options and features, including the several we listed above for wget. However, they are beyond the scope of this class. Since wget will work for all of the activities we have planned for this and future UNIX courses, we will not cover curl options.

Take a look at curl's man page for more information on it.

If you would like to see Curl's home page, check out: http://curl.haxx.se/


Classnotes | UNIX01 | RecentChanges | Preferences
This page is read-only | View other revisions
Last edited August 2, 2003 1:42 am (diff)
Search:
(C) Copyright 2003 Samuel Hart
Creative Commons License
This work is licensed under a Creative Commons License.