Recently, on a forum I frequent, some requests for help were posted to deal with web scraping, Referral Spam and hotlinking (bandwidth theft).
There are a few solutions to each problem and some of the problems also share a common solution.
IMPORTANT NOTE: The solutions provided assume that either a) your site is hosted on webserver running apache (or compatible), or b) the webserver hosting your site allows php scripts.
In part 1 I’ll cover preventing hotlinking/bandwidth theft.
What exactly is it?
Hotlinking/bandwidth theft is when there is an image hosted on "Website A" and "Website B" is directly linking to the image on "Website A" by using the <img> tag. The image appears on "Website B" without it ever having to be stored there. Although images are usually the target of hotlinking, other types of media too (for example sound and video files) are also targets, too.
Bandwidth means the amount of data that can be transferred from one point to another. Like internet connections are limited in bandwidth, websites connections are often limited, too. The limitations can be:
- Monthly (or even daily/weekly) bandwidth allowance.
- A definite limitation will be the bandwidth transfer rate; a typical ADSL or cable internet connection can transfer up to 1mbit/sec. A web-host will have higher data-transfer capabilities, but will (usually) limit it per domain/account.
The implications of hotlinking are:
- Theft of bandwidth by "Website B" that "Website A" is paying for
- Copyright infringement by "Website B" if "Website A" has copyrighted the hotlinked material.
- Reducing the performance of "Website A" due to the bandwidth theft by "Website B" which lowers the available bandwidth for people browsing "Website A", therefore making page loads slower.
How can it be stopped?
There are a few ways to stop hotlinking/bandwidth theft. The first way is to edit (or create, if it does not already exist) the .htaccess file. This is a file used to give instructions to a webserver when a website is accessed by a browser. Place the .htaccess file in the root folder of your website (for example. /home/websita.com/.htaccess).
Apache’s mod_rewrite [1] will be used for checking the referrer (written as REFERER in the rules) and redirecting the request if it does not meet a certain condition. A referrer is the page where the request originates from.
The following instructs the webserver to display an alternative image file whenever a website that is not your own domain (websitea.com) requests an image:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http://([^/]+\.)?websitea\.com/ [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteRule \.(jpg|jpeg|gif|png)$ getlost.jpg [L]
</IfModule>
"getlost.jpg"
This can be just a simple image file containing "this site is stealing bandwidth", or even an advert for your own website.
! stands for not and ^ indicates that this is the start of the string (i.e. "^chocolate" would restrict to anything beginning with the word chocolate whereas chocolate without the ^ prefix can mean hotchocolate as well as chocolate and chocolatehot). So any referrer except websitea.com will be blocked. !^$ will allow an empty referrer (not all browsers send the referring address). [NC] stands for no case, so the pattern match will not be case sensitive. [L] tells the webserver that this is the last rule and it can continue to serve the http request.
To allow other sites to hotlink, but ban a single site (websiteb.com), then the following should be used:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://([^/]+\.)?websiteb\.com [NC]
RewriteRule \.(jpg|jpeg|gif|png)$ getlost.jpg [L]
</IfModule>
More than one domain can be blocked:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://([^/]+\.)?websiteb\.com [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://([^/]+\.)?websiteb2\.com [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://([^/]+\.)?websiteb3\.com [NC]
RewriteRule \.(jpg|jpeg|gif|png)$ getlost.jpg [L]
</IfModule>
As there is no ! before "^http://.." the webserver will exclude any site that matches the subsequent text.
I recommend that the alternative method is a single pixel gif or jpg, otherwise any larger image will be very bandwidth-intensive. Another option is to just forbid the site altogether. This can be done by replacing the last line in the above examples with:
RewriteRule \.(jpg|jpeg|gif|png)$ - [F]
[F] tells the webserver that access to image files is forbidden. A small text will be sent to the person’s browser, similar to:
Forbidden You don't have permission to access <image> on this server.
However, no text will be seen in the browser because the image is linked through an <img> tag. The least bandwidth-intensive option is to upload an empty text file to your website, for example "empty.txt", and change the last line to:
RewriteRule \.(jpg|jpeg|gif|png)$ empty.txt [L]
Another alternative is to use a php script. I wrote one last year, for handling images, and it provides the similar functionality as .htaccess. However, it can also be used to help hide exactly where the images are located.
<?php
$valid_domains = array("websitea.com", "www.websitea.com");
$base_url = "http://www.websitea.com/";
$email_address = "webmaster@websitea.com";
$image_directory = "/home/websitea.com/public_html/abcd1234/";
function is_referrer_ok($referrer, $valid_domains) {
$valid_referrer = 0;
$auth_referrer = current($valid_domains);
while($auth_referrer) {
if(eregi("^https?://$auth_referrer/", $referrer)) {
$valid_referrer = 1;
break;
}
$auth_referrer = next($valid_domains);
}
return $valid_referrer;
}
$image = $_GET['image'];
$referrer = $_SERVER['HTTP_REFERER'];
if(isset($_GET['image'])) {
if(empty($referrer) || is_referrer_ok($referrer, $valid_domains)) {
$image_path = $image_directory . $image;
$image_info = getimagesize($image_path);
if($image_info[2] == 3) {
$image_type = "png";
} elseif ($image_info[2] == 1) {
$image_type = "gif";
} elseif ($image_info[2] == 2) {
$image_type = "jpeg";
} else {
header("HTTP/1.1 404 Not Found");
exit;
}
header("Content-type: image/$image_type");
@readfile($image_path);
} else {
if(isset($email_address)) {
$date = date('d-M-Y', time());
$time = date('H:i:s', time());
mail($email_address, "WARNING: Unauthorised Image Link Attempt!",
"Unknown site $referrer tried to access $image on $date at $time\n",
"From: Image Script <$email_address>");
}
header("HTTP/1.1 404 Not Found");
}
} else {
header("Location: $base_url");
}
?>
To configure the script, copy and paste it into a text editor and change the following variables:
$valid_domains
An array for storing the website’s domain name(s) and any other allowed sites. Enter the domain names you use and any other domains you wish to allow access.
$base_url
The website’s address (it should include http://).
$email_address
The email that bandwidth theft warnings should be sent to (this is optional).
$image_directory
The path to where the images are located on the server. It’s a good idea to store them in a randomly named directory (for example acyz66), so it’s harder for people to guess the actual location and link directly to images.
Once configured, save it in your website’s root as image.php and then link to images using the script, for example:
<img src="image.php?image=image.jpg">
This script can also be used in combination with .htaccess. It’s possible to use apache’s mod_rewrite to redirect hotlink requests for images through it:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} \.(jpg|jpeg|gif|png)$ [NC]
RewriteCond %{HTTP_REFERER} !^http://([^/]+\.)?websitea\.com/ [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteRule (.*) /image.php?image=$1 [NC,L]
</IfModule>
The reason to do this is so that an email notification can be sent every time someone tries to hotlink an image.
In part 2 I will cover preventing web scraping and Referral Spam.
[1] http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html
Related posts:
I’m trying to make an image load a URL instead.
Unfortunately, it’s not possible if the image is being opened via the img tag, for example, as it’s designed just to show an image.