linux.oldcrank.com:
Defending Against Email Harvesters,
Leechers, and Web Beacons
©2003 Loran T. Hughes
All Rights Reserved
Spammers loved my web site. I know this because my logs showed plenty of hits from "EmailSiphon," "EmailWolf," and other popular email harvesters. From the volume of spam in my inbox, I know they've successfully found the email address that I naively placed on my site. Worse yet, I found that my entire site had been downloaded by persons unknown and that I was unwittingly notifying spammers that my email address was valid simply by deleting the spam!
There is hope. With a little planning, your web site can be protected against harvesters and leechers. This article will outline how to keep these malicious users out of your web site, protect your published email address, yet allow full access to legitimate users. I can't take credit for much of this information (or code, for that matter), but I collected and modified numerous tips & code snippets to create a Swiss Army Knife for the sport of robot wrangling.
The following is mainly aimed at webmasters using Apache, PHP, and Javascript. The same concepts can be implemented with cgi scripts under any web server.
Know Your Enemy
Joe Spammer knows better than buying CD-ROM's full of email addresses. After all, who knows how old that list is? No, Joe wisely invests his money in an email harvesting program. He can input keywords to target his captive audience, then set the beast loose. The harvester will use a major search engine (such as Google) to search for web pages (or newsgroups) matching keyword(s), then sniff each "hit" for email addresses (or mailing addresses, phone numbers, fax numbers, etc.). Some harvesters have built-in email servers with the ability to search and send spam at the same time.
Joe may be a weekend warrior, barely smart enough to find the power switch on his computer - not smart enough to mask the signature User Agent of his harvester. However, Joe may be a professional spam meister, able to make his harvester look like it is just a harmless version of Internet Explorer. In fact, Joe may have the skills to get you to verify your valid email address without your knowledge. No matter how crafty Joe is, he can be defeated.
Robot Wrangling, Part I:
Good Bots, robots.txt, and the Meta Robots tag
Believe it or not, there are good robots out there. For example, Googlebot indexes web pages for the Google search engine. If you want your web site listed in Google, you need to let this bot crawl your site periodically.
A well behaved bot will truthfully identify itself, look for the robots.txt file in the root directory of your website, and follow the rules it finds. It's always a good idea to set limits as shown in this example robots.txt:
User-agent: *
The first section in this example instructs all bots ('all' as indicated by *) to index all directories except /cgi-bin and /css. The second section specifically denies /images from being crawled by Googlebot-Image, but other bots may crawl /image since they are not specifically disallowed. Add as many rules as you desire, but be careful not to create conflicting rules.
Disallow: /cgi-bin
Disallow: /css
User-agent: Googlebot-Image
Disallow: /images
Good bots will also look for the meta robots tag in the header of each web page. If robots.txt is found, it will apply those rules first, then the rules found in meta robots. Meta tags are placed in the head section of the web page:
<meta name="robots" content="index,follow">
For a detailed discussion of the meta robots tag, see the
HTML Author's Guide to the Robots META tag.
Robot Wrangling Part II:
Bad Bots & .htaccess
Obviously, bad bots don't care about robots.txt or meta robots tags. Fortunately, most garden variety spammers tend to be a lazy lot and don't bother to spoof a legitimate browser User Agent in their software. You can block known User Agents in .htaccess (placed in the root directory of your web site):
SetEnvIfNoCase User-Agent "CherryPicker" bad_bot
The .htaccess entries shown above are only a few of the nasty things you might want to block. I've taken the liberty of compiling a list of download managers, file suckers, and email harvesters that I like to keep away from my site. View it here.
SetEnvIfNoCase User-Agent "Crescent" bad_bot
SetEnvIfNoCase User-Agent "EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "Website\ eXtractor" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
Note to ISP's: Why not save your hosting clients some heartache and block such things in httpd.conf? They'll get less spam and your email server will thank you.
Revenge of the Nerds:
Nuke 'em with PHP!
Now that we've wrangled the good bots and shoo'd way the amateur spam harvesters, it's time to get tough on the hardline spam meisters and web leechers. Since malicious software has the ability to spoof valid User Agents, we'll use php scripting to man our security gate.
The first thing we'll do is plug holes in the web site. That is, any directory that doesn't have an index.php file will now get one that redirects the visitor's browser to the main index page. This will prevent the casual snoop from looking at the contents of /cgi-bin, etc. The index.php for this job simply contains:
<?php
Note: You can also block directory viewing by placing an .htaccess file inside the directory you wish to protect. The .htaccess should simply contain this line:
header ("Location: http://www.yourwebsite.com/index.php");
?>
IndexIgnore *
Real Time Blackhole
Now we'll create a real time blackhole for the malicious bots that snuck past our .htaccess file. For this exercise, we'll create two directories named /sandtrap and /ban-ip. Create an empty file named ban-ip.txt in directory /ban-ip. Make sure the permissions are set to 755 on /ban-ip and 777 on ban-ip.txt.
Place this index.php file inside /sandtrap:
<?php
Anyone accessing /sandtrap/index.php will have his IP address written to the file ban-ip.txt. The trap is set, now for the bait. Place a hidden link someplace on your home index page that points to /sandtrap/index.php. Make sure that it appears in the html code as the first link on the page, but not so close to the top that it might show up in a web indexer's description of your web site (I put mine in a hidden <div> statement about three paragraphs into the page). Most importantly, do not place any readable email addresses on your home index page! The email harvester takes links in order, so the first link logs the IP to our real time blackhole list before it is able to access our remaining links.
$ip = "$REMOTE_ADDR\n" ;
$banip = '/path/to/ban-ip/ban-ip.txt';
$fp = fopen($banip, "a");
$write = fputs($fp, $ip);
fclose($fp);
?>
Before moving on, we need to take a couple of steps back to robots.txt and .htaccess. Remember, we don't want good bots to fall into our sandtrap, or malicious types gaining access to certain files. Place the following lines in robots.txt:
User-agent: *
In .htaccess, we can block anyone from viewing our world readable/writable ban-ip.txt file:
Disallow: /sandtrap
Disallow: /ban-ip
SetEnvIfNoCase Request_URI ban-ip\.txt ban
<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>
Pulling it all Together
You may want to force every visitor to access your web site through the 'front door.' That is, if someone comes to the site from an outside referrer, they must go through the "home" page before being allowed to navigate to child pages. However, you will want to make an exception for some good robots to index these child pages directly, as defined in the file "indexer.txt." Place this php code snippet at the very top of each child page in your site, above the <html> tag:
<?php
$engine = file('/path/to/indexer.txt');
$ref = getenv('HTTP_REFERER');
$ua = $HTTP_SERVER_VARS['HTTP_USER_AGENT'];
$home = "yourwebsite.com" ;
$browse = 0 ;
if (stristr($ref, $home))
{
$browse = 1 ;
}
foreach( $engine as $indexer )
{
$indexer = rtrim( $indexer ) ;
if (stristr($ua, $indexer))
$browse = 1 ;
}
if ($browse == 0)
{
header("Location: http://www.yourwebsite.com/index.php");
}
?>
Our php security gate, which blocks from the real time blackhole list and any User Agent that doesn't identify itself, goes at the top of your home index.php file:
<?php
Note that with a small modification, this code can also perform the function of .htaccess. That's it! Your site will now lock out anyone with less than honest intent. As long as you keep email addresses off the home index page (or obfuscate them), it will be extremely difficult for email harvesters to get to them. Since many of these bad bots will be coming from dial-up sources with dynamic IP addresses, it would be a good idea to periodically clear the ban-ip.txt file so legitimate users aren't locked out.
$sandtrap = file('/path/to/ban-ip.txt');
$ua = $HTTP_SERVER_VARS['HTTP_USER_AGENT'];
$ip = $REMOTE_ADDR ;
$punish = 0;
if ( $ua == "" )
{
$punish = 1 ;
}
foreach( $sandtrap as $blockip )
{
$blockip = rtrim( $blockip ) ;
if (stristr($ip, $blockip))
$punish = 1 ;
}
if ( $punish == 1 )
{
echo "<HTML><head><title>Access Denied</title></head>
<p>The software you are using to access our website is not allowed.</p>
}
?>
If you don't want or need everyone going through your home page, simply place the php "security gate" at the top of each page and include a hidden link to the real time blackhole.
Last Line of Defense:
Obfuscate Contact Information
We need to slam the last door on spammers and make our email addresses, phone numbers, etc., unreadable to the harvester programs. The method I recommend is with Javascript, using the free Hiveware Enkoder Form. Due to the length of the resulting code, I place it in an external Javascript file and code in my HTML as:
<script type="text/javascript" src="email.js"></script>
Unfortunately, the Hiveware Enkoder does not create code that is compatible with the Netscape 4.x series of browsers. If you need this backward compatibility, you still can enjoy a somewhat less secure level of obfuscation by using the Javascript document.write function:
<!--
Place the code in a seperate file, such as email.js, then place the script in your HTML as shown for the Hiveware Enkoder generated Javascript.
document.write('<a href="mailto:somebody@somewhere.com">Email Me!</a>');
//-->
So far we've wrangled good bots through robots.txt and meta tags, blocked bad bots with .htaccess, stopped sneaky bots cold with creative php scripting, and foiled the persistent with Javascript. Now we need make sure our email client is a safe environment to read our mail.
Client Side:
Stupid Spam Tricks
Spammers use a variety of tricks to make it difficult to filter unsolicited email. These methods range from base64 encoded HTML text to breaking up words with comment or fake tags:
Text is bro<-- Mary had a little lamb -->ken up with comments
These fake and comment tags won't be visible when you view your email, but make it nearly impossible to filter on common spam terminology.
This text is bro<fhduuaoe9e37>ken with a fake tag
One of the most insidious tricks in the spammer's play book is a method of verifying that you actually received and opened their spam. So called "web beacons" are embedded in HTML email; when you open the mail, it accesses a script on the spammer's web site, verifying that your email address is valid. This ensures your inbox will be spammed relentlessly. Such coding will appear in the email source as this example:
<img src="http://www.somespamsite.com/script.php?youraddress@somewhere.com" border="0" width="0" height="0"></img>
The worst part is that if you simply highlight the email to delete, it will activate the script. The only way to defeat it is to turn off external image loading in your email client. Microsoft Outlook does not allow you to turn this "feature" off. Therefore, I recommend that you DO NOT USE OUTLOOK (for this and other reasons).
Netscape and Mozilla users can disable external image linking by selecting Edit | Preferences | Privacy & Security | Images. Check the box labelled "Do not load remote images in Mail & Newsgroup messages."
In Ximian Evolution, select Tools | Settings | Mail Preferences. Click the "HTML Mail" tab, then select either "Never load images off the net" or "Load images if sender is in addressbook."
Other email clients should be set up similarly to deactivate web beacons.
Junk Filters
Since spammers are always adapting, setting up good junk filters in you email client can be challenging. As a start, filter for these terms in the message body:
<img
Note: Yahoo! Groups makes use of web beacons, so filtering "<img" will block messages coming from your group. The solution is to go into your Yahoo! Groups membership options and turn off HTML email delivery.
<table
<head
<!
<-
Some ISP's have email filtering at the server level that you may be able to configure. Filters can also be set in almost any email client. If you use Yahoo! Mail or Hotmail, filters may be configured through the options menu.
Reference Links:
Guestbooks: Protecting Email Addresses, by Rik Nilsson
Stopping Spambots: A Spambot Trap, by Neil Gunton
wpoison.php: The PHP port of wpoison, by by Jason A. Borgmann
Questions or Comments?
Feel free to email me at with your comments or questions. If you need a custom solution, I am available for consultation.





