The date of the first publication: December 29, 2005
As far as achieving better SEO results is concerned, the reality these days is that knowing how to detect sites that use spammy SEO techniques is becoming more important than knowing where to place your keywords. With spammy sites growing in number at a tremendous rate and the search engines severely penalising websites for linking to such neighbourhoods, we simply can’t neglect the issue any longer.
Granted, nobody can ever be 100% sure the site s/he is reviewing is absolutely spam-free. There are a lot of different ways to spam the engines, and many of them are not obvious at all. But knowing how to detect the most obvious and primitive types of search engines spam can still protect us from exceeding the critical bad neighbourhood percentage and receiving a penalty.
So, let’s go step by step.
First, it is a good idea to check whether the site in question is banned from Google (which, of all search engines, is the best at detecting spam). A quick glance at the Google Toolbar, which, as we all know, has the Google PR indicator, will tell us if the PR of the site is equal to zero. If not, the site is obviously not banned. If the site is PR0, we need to do more research. Entering site:www.insertthenamehere.com into Google will tell us a lot more. Let’s say the site is neatly indexed, each page has a description and none of them has the dreaded “Supplemental Result” addition under the description (meaning Google approves of the site and, so far, hasn’t detected anything dubious there). It also means the site is well structured and search engine friendly. PR0, in this case, would mean “new site”, and after the next Toolbar PR update we will see a nice 4/10 there. Maybe more.
But if Google shows: “Your search – site:www.insertthenamehere.com – did not match any documents”, then the site is, most likely, in trouble. Possibly it just has no incoming links yet, and will never be listed in the engines at all, unless at least some links appear over time. You can easily check it by running the link:www.domainname.com search in the MSN search engine. Of all engines, MSN gives the most accurate information on incoming links. If MSN shows a few thousands links but Google doesn’t seem to know anything about the site, it is a clear case of a banned site. To be double sure, do the same site: test using the Yahoo! search engine.
There are other, less obvious, cases, which we will discuss below. For now, another very simple test you can start with is to wait until the page is completely loaded, then click Ctrl+A to highlight the whole page. If the page contains any hidden text (i.e. text written in white on a white background, or blue on a blue background) it will immediately become visible. If that’s the case, you need go no further. The site is spamming, and even if it is not banned from any engine yet, it is a bad neighbourhood.
You need to look at invisible images, as well as at invisible text. Single-pixel spacer.gif images are usually not spam, as webmasters traditionally use them to keep table cells from shrinking and for some other purposes. But if this image has an “alt” attribute stuffed with keywords or, worse still, is linked to something, it is a good reason to stay away from this website. (Note though that invisible images are sometimes linked to visit counters, including some counters provided by the engines themselves, and also are used for conversion tracking scripts, e.g. for Google AdWords PPC campaigns, so be careful with your judgements.)
Spam that is visible
Not all spam is hidden. For example, link farms usually don’t hide links, but it doesn’t make them less spammy. Quite often, FFA pages even have the infamous acronym in their URLs, as if their creators are proud of themselves (believe it or not, most spammers are really proud of their work). So a FFA (free for all) page is easy to distinguish. It will be quite long, full of links and totally useless to a human visitor.
To research into the backlinks of a site, use the MSN search engine and the link: command. Quite often, you will see that some sites have built their inbound links by submitting to linkfarms (it is very funny to view those identical sites, one after another, sometimes with different background colour but still looking like twins and even having identical page names). Other sites obviously bought their incoming links, and to keep the prices low, agreed upon hidden links. If you are willing to dig really deep, you might want to check the whois information. If you are seeing that all domains linking to the original site also belong to the same owner, just stay away.
Often, you will see that the same webmaster has created 40 or 50 websites, and linked each page on each site to all other sites (usually at the bottom). Sometimes, people who do so sincerely believe they are doing nothing wrong, and are just trying to improve their web visibility when in fact they are asking for a severe penalty from Google and Yahoo! You are much safer if you don’t link to such sites. Too many outbound links at the bottom of the website’s page should immediately put you on the alert. It can only mean a heavily cross-linked network or, alternatively, a site blatantly selling text links for the sake of PageRank.
Does the text look weird to you?
Another example of visible spam is keyword stuffed copy. It is very easy to spot, because repetitive keywords look weird to the eye. If you look at the copy and feel like there is something wrong with it, read it out loud but be careful you don’t sprain your tongue!
Doorways and cloaking
These are harder to spot. If the site is small enough, you can again do the site: search and go through all pages, especially those marked as “Supplemental results”. Often, Google’s smart filters mark the doorways as “Supplementals” because they can detect something wrong with them.
The doorways are usually either meaningless or nearly identical to each other (only the main keyword changes when you move to the next page). The file names often repeat the targeted keyword. Some doorways will redirect you immediately to the home page of the site; others will just contain a link you should click. The doorways themselves will be shamefully linked through an invisible image or something looking like an element of the graphic layout. You can come across such a link quite by chance.
Of course, the bigger the site, the harder it will be to check for doorways and hidden links to them.
Cloaking is harder still to detect. You can compare the cached version of the page with the one you are actually seeing. If there is a considerable difference, chances are the page is cloaked. But it can also mean that the page has recently been changed, and not yet re-indexed by engines.
For Firefox users, it could be a good idea to set the “user-agent” parameter into “Googlebot”. Some funny sites can be found this way.
To check the cached version of the page in Google, use the “cache:” advanced operator with the exact URL of the page. If you have the Toolbar installed, you can also right-click the page and select the “Cached Snapshot of Page” option in the popup menu.
Some sites will turn the caching off through meta-tags. Since it is usually done to hide the fact that the site is cloaked, just stay away from such sites if you wish to be double safe.
What else the engines can tell us
The same site: search in Google can give us a lot more information about the site we are checking. We just need to know how to look at it.
For example, the “Supplemental Result” mark can mean a lot. If the site is dynamic in nature, the Supplemental Results will appear now and again to mark auxiliary pages that don’t contain much useful information, or pages that can be accessed through different URLs (a situation very common with forums, for example). It can also mean that the page is physically deleted from the web server by the site owner, but not yet dropped from Google’s database. But in many cases it also means one or another type of spam is involved. When all pages are reported as Supplemental Results, there is a good reason to suspect the site is approaching a permanent ban stage.
Apart from Supplemental Results, you can often see so called PIPs (Partially Indexed Pages) in Google’s SERPs. It means, the page is listed by its URL only, without any description at all. If there are too many PIPs on the site:www.domainnamehere.com listing, it is a very bad sign. But before shouting “Spam!” you might want to check the robots.txt file. If you see that pages in question are disallowed from indexing, it explains the PIPs. Again, dynamic sites often show a lot of PIPs just because of their dynamic nature.
To link or not to link?
The procedures described above look like a lot of work, and those of my readers who have managed to read the whole article are probably asking themselves, “Wouldn’t it be easier to just not link to any sites at all?” It certainly would. But while the sites that link to numerous bad neighbourhoods face the risk of a penalty, those that don’t link out at all can be regarded as poor quality resources, because they work against the very spirit of the Net.
The Internet is supposed to be interlinked, so that every surfer, including inexperienced newbies, can intuitively find all the information on a subject simply by following links. So, making your site a dead end is not the best solution, even though it is the easiest one.
I would still recommend site owners to link out and to do it freely and generously. Actually, when you get used to the procedure it is not too much work to check a website for the most obvious spam techniques. Soon, your intuition will reach the point where you will be able to “smell” spam and from the look and feel of the site immediately tell when one is likely to be spammy while another one is most likely clean. I can’t explain this effect but it exists.
Are you ready? Good. Now, let’s go and build the web. The White Hat Web.