These classnotes are depreciated. As of 2005, I no longer teach the classes. Notes will remain online for legacy purposes

UNIX03/Special Techniques For Web Servers Part 1

Classnotes | UNIX03 | RecentChanges | Preferences

Build Separate Castles

For all but the one-person operations, it is recommended that you use separate dedicated boxes for each of web pages, CGI, important databases, and e-mail. This will prevent an intrusion on one of these services from affecting the others. CGI scripts are notorious for security problems, and if you are also running Sendmail (ignoring my advice to run Postfix) then the two in combination can be very bad (CGI scripts will always have direct access to the Sendmail binaries on the system, unless you are running them chroot'ed).

Do Not Trust CGIs

Many CGIs you will find on the internet are quick hacks written by people who are not knowledgable about security. If you are running a site with extreme concerns for security, you may not want to run CGI at all.

  • Note: You can, technically, have your cake and eat it to with respect to CGI and dynamic web-pages. Always remember that you do have many underlying UNIX utilities at your fingertips ready to lend a hand. For example, you can keep your CGI to a minimum by having it run at intervals and producing static page output that the server then reads. By doing this, you can still have the versatility of dynamic page sites, but with the security (and speed) of static page sites. (This is actually what [Slashdot] does, though, for speed and not security).

You should also note that security varies depending upon the server side language, and some are better than others. We will look at these shortly.

Hidden Form Variables and Poisoned Cookies

Many e-commerce sites store merchandise information in hidden HTML variables. Any halfway knowledgable person could simply download the HTML file, modify it, and then reload it to modify things like prices and weights (for shipping costs).

A similar problem exists for people who rely upon cookies to store such information in. Even if the cookie contains encrypted data, it is possible for a malicious user to break them and modify settings for their nefarious purposes.

Robot Exclusion of Web Pages

A robot (aka web crawler or spider) is a program that searches all of the web looking for web pages. It then parses the page and usually performs some sort of archiving (it could be for a search engine, ala Google, for a web-archive service such as http://www.archive.org/, or it could be a spam-bot harvesting e-mail addresses).

There are many reasons to prohibit robots from accessing portions of your site: it could be that your site has a no-deep-linking policy (a policy I don't personally agree with), or you may have dynamically generated pages which robots could get caught in recursive loops with (thus eating up bandwidth, possibly even DoS'ing the server), or you may simply want to fight spam-bots.

There are two standards for exlcuding robots from particular webpages. The first is to define a robots.txt file and place it in the web-root of your website. For example, here is the robots.txt file for Freshmeat.net:

The format of this file is known as "The Robots Exclusion Protocol". The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot.

In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:

 User-agent: *
 Disallow: /

to see if it is allowed to retrieve the document. A more realistic example of robots.txt might be the following:

 User-agent: *
 Disallow: /images/
 Disallow: /tmp/
 Disallow: /email/ # A trap (more later)

Another way of preventing robots from accessing a web-site is to use the ROBOTS meta-tag in the page's HTML:

 <META name="ROBOTS" content="NOINDEX, NOFOLLOW">

Now, this is all well and good, provided the robot actually obeys the robots.txt file or ROBOTS meta-tag. Good robots will.... Bad robots wont.

Because you know that bad robots (such as spam-bots, bots looking for exploits, etc.) will not be paying attention to these files, you can actually use this against them to create clever little traps for them.

Trapping bad-bots

The simplest method for combatting bad bots is to allow the use of .htaccess and then "guide" the bad-bots into directories containing restrictive .htaccess files.

For example, in the above robots.txt example, we disallowed robots from entering the /email/ directory:

 Disallow: /email/

We did this deliberately, knowing that good robots would not enter it and bad-robots would. If the bot was a spam-bot, it's algorithms could be very interested in any link with the word "email" in it. If it was an exploit-bot, then pages containing email addresses would be the prime place to locate user account names.

You can then insert an .htaccess file that enumerates specific known bad-bots or known bad addresses and denies access from them. For example:

 SetEnvIfNoCase User-Agent "Indy Library" bad_bot
 SetEnvIfNoCase User-Agent "Internet Explore 5.x" bad_bot
 SetEnvIf Remote_Addr "195\.154\.174\.[0-9]+" bad_bot
 SetEnvIf Remote_Addr "211\.101\.[45]\.[0-9]+" bad_bot
 Order Allow,Deny
 Allow from all
 Deny from env=bad_bot

Something like this will present the bad bots with an error. Most of these bots will cease to look further in your site.

However, this is obviously not a full solution, as you would undoubtedly have bot user-agents that were not known to you, as well as clever bots that would ignore such errors. To battle these bots, you could use a script that endlessly feeds the bots bogus data.

For example, by feeding a spam-bot a huge list of bogus e-mail addresses, you have wasted the spam-bot owners time and effort collecting those addresses. You've ruined their harvest, and they will think twice about setting their bot on your site again.

By feeding exploit-bots fake addresses, you could be leading them into a honeypot (which we will look in the coming weeks), or a trap that would allow you to catch them attempting something nefarious.



Classnotes | UNIX03 | RecentChanges | Preferences
This page is read-only | View other revisions
Last edited June 14, 2003 12:02 am (diff)
Search:
(C) Copyright 2003 Samuel Hart
Creative Commons License
This work is licensed under a Creative Commons License.