searmyst.htm: How to search: search engines' vagaries, explained by fravia+

S E A R C H	fravia's search engines' vagaries Fravia's Nofrill Web design (1998)		updated November 1998	Search engines' vagaries First elements of searchenginology - 1 Well, ever wondered which among the search engines get more hits, which one does index more pages, which SEs are more spammed (where the relation between "noise" and relevant information is worse)? And did you even wonder why there's so much opportunistic spam among the first links that most SE return (especially when too broad queries have been performed)? And did you ever think that each search engine has its own algos in order to refresh/accept/select/classify/list the results of your queries... and that these same algos can be (pretty easily) cracked... and that many commercial 'slave-catchers' are doing "professionally" exactly that, in order to scrap some money from your useless clicks (destroying at the same time the value of the search engine services for million users)? Good old fravia+ will now explain all this (at least in part), so that you may choose and use better the SEs... and survive on the web!

[Some data] [Spam for clicking] [Reversing search spiders] [Write your own search bots] [Fravia's tips for your own site]

Some data
Main SEs: Indexed pages, web-coverage and number of monthly visitors (font: fravia's scripts, results for mid July)

Search engine		Indexed pages web coverage monthly visitors
Altavista (AV)		150 Mil		42,9%  	 	 7 Mil
Hotbot (HB)		120 Mil		34,3%	 	 5 Mil
Northernlight (NL)	 85 Mil		24,3%	 	 3 Mil
Excite	(EX)		 50 Mil		14,3%		15 Mil
Infoseek (IS)		 33 Mil	 	 9,4%		13 Mil
Lycos (LY)		 31 Mil		 8,9%		10 Mil
Webcrawler (WC)		  3 Mil		 0,9%		 6 Mil
Yahoo (YA)		not		not		32 Mil

What can we conclude from these data?

That librarian and researchers and anyone seeking hard-to-find information (like us) should by all means use Altavista or Inktomi (that powers Hotbot) or Norternlight.
That even Altavista, that indexes 140 million pages and crawls more than 10 million pages per day (like HB and LY... most of the time to refresh links or to add manually submitted sites, NOT to search uncharted waters!) does only index LESS THAN HALF of the web! This makes it very important that you learn the OTHER search techniques (combing, klebing, etc.) that you'll find on my how to search page!
That people don't use nor access search engines based on their real or supposed quality, but rather moronically following the pre-chewed links that they find inside their browsers (AOL-browser, Netscape search page, etceteras). So the number of monthly visitors does not have ANY interest whatsoever in order to judge the quality of a search engine (actually the contrary, since the more visitors a SE has, the more incentive there is for the slave-catchers for putting commercial spam inside).
This said there are also other factors in play: 'link popularity' for instance should play a role when indexing, yet only HB, LY and WC take account of it; unfortunately not all search engines show the DATE of the fished links (AV, HB, IS, LY and NL do, EX and WC do not), which is useful for you in order to 'guess' which info may be more fresh and up to date when you peruse your results. Text positioning plays also a role for many search engines, as you'll read below.

Spam for clicking
Spam Spam is, in the case of the search engines, something different from the usual email spam that you already hate, yet it is nevertheless an opportunistic crap. There are people all over the world, that scrap money luring lusers and zombies into clicking banners (after having sold to their unhappy clients the lie that clicking means 'commercial opportunity').
Those commercial slave-catchers have quickly understood a couple of simple truths:

1) Lotta people use the Search engines (duh)
2) Each search engine uses more or less simple algorithms to index the links resulting from a query.
3) It is VERY important to be listed among the top 10 links reported by a search engine, because zombies and lusers, that dunnow nothing about searching, just go for it immediately instead of having a more thorough look at the results (or even, as I advice, instead of performing the same query using MORE THAN ONE search engine and evaluating the answers BEFORE running into sideways).
4) There's no need to have a try for quality: just create (for instance) hundred pages that click to each other and that carry text and tricks INTENDED TO FOOL THE SE's often por 'quality' algorithms.

The consequence is that you'll find often enough as answer to your queries only 'commercial' pages, that HAVE BEEN DESIGNED in order to figure among the first positions, instead of finding the true knowledge sites you're looking for, that don't care for this crap 'positioning'. It's as simple as that: the search engines that you are using are not giving you what you expect, nor what they are supposed to, they are just giving you tons of commercial spam. And millions of users are wasting time, so that the commercial slave-catchers can scrap some money, as ususal...
That's a reason good enough to cross the commercial SE's spammers' planes, annoy them and eventually destroy their soo artefully designed sites, and I'll teach you how to do it...

Reversing search spiders
Any real reverser can quickly fool the SE's algos (commercial spammers, that are pretty stupid individuals, actually do it all the time - for money).
Algos vary from search engine to search engine (of course, that's the reason the same query gives DIFFERENT results on each search engine, btw) and may be very simple or complex.

The simplest way to reverse the SE's algos is to perform searches on very commercial subjects (those where the spammers from the 'insert site consulencies' battle a lot) and have a look at the first 15-30 sites that you'll get as result of your query. Your reversing blick will quickly pick up the relevant patterns...
Let's take as an example Excite, in order to let you grasp the complexity (and at the same time the banality) of the issues at stake.
The following is taken from my own essay reversing search engines' bots:

Excite: if you try to submit more than twenty pages then its spider will begin to penalize you.
Excite: is undergoing an indexing change right now, and in the mean time seems to have TWO spiders: a "roaming spider", now only looking for homepages, whose pages have a time limit (they will be drop after that) and a "Fresh Spider" which confirms page changes and submittals. It seems that on 21/7 the older algos took over and Excite started reindexing subpages (and not only index pages).
Excite: has NOT all ist stuff stored on one huge computer, of course, but onto several. On one search one computer may be down, or overloaded, therefore you may get different answers for the same search on Excite.
Excite: Since Excite has the habit to 'drop' pages, in order to spam Excite, most commercial slaves use (and need) two domains. They build say 10 doorway pages on domain A, and make sure that they have a page on domain B that links to each of those doorway pages.
Excite: Since Excite is heavily used by the AOL types you should use alt tags, to identify buttons, etc, not to keyword stuff.
Excite: Very sensible to spam from comment tags (at the bottom of the spammer's page), with complete sentences structured around his 'targeted keyword phrase' the searchstring where he hopes to appear among the first ten when queried.
Excite: 'Relevant' links play a role in Excite's algos: pages that come up on top are largely composed of a list of links to other pages.
Excite's algos punish spammers that repeat keywords closer than 7 words apart
Excite's algos give special attention to the bottom of the page, believing that most spammers place extra emphasis on placing words early in the document (top of the page). Notice how Excite spammers do not put any copyright or other nonsense after having placed their last set of spamming keywords, which are always the last thing on the page.
Excite falls for links named after the spammers' keywords (especially if on the bottom middle of the page).
Excite falls for keywords put in the ALT tags of bullets. Excite is using IMO 4 different algorithms, the two major ones are easy to see: each cycle lasting approx two weeks. Excite reversers can actually observe the "switch" on Monday morning.
Excite: hidden text will get spammers banned from Excite, never to return again (they are very unforgiving).

The above snippets are taken from the 'Excite' chapter of my own (unpublished) 'reversing search engines bots' small booklet, yet every single Search engine has its own idiosynchrasies.
Altavista, for instance, seems to be rotating at least two ranking methods several times during the course of a day, in order to keep spammers at by. On Altavista, quite correctly, "root domains" get a relevancy boost, which frustrates spammers. But many other small things seem also to be in play. I believe, for instance, that font size increases keyword weight on some AV algos (because they assume that parts of text written in bigger fonts are more relevant for the page). Also If you try to submit more than 60 pages for a given domain, AV kills it... how do the spammers then spam in this case? They use a unix shell and start a lynx -dump. So they can submit 20.000 pages in 4 hours from only one server.
Altavista searches use moreover different algorithms depending of the PART of the database they are falling in: The main AV page uses one algorithm, yet the small AV search panel in Micro$oft's Explorer 4 and up (yes, you'll have to lower yourself to use this puke browser too, if you want to fish algos on the search engines), uses a different algorithm at AV. You can tell that the M$IE searches are different, because they will have a "n200" in the referer string in your logs. AltaVista algos are based on the oldest search engine, the one in use on the web of the 'older ones', and AV is THEREFORE still the best place to find detailed information, even if the submitters (or spalmmers) never indended to list it that way. AV bots still follow all links they find, and sometime they go 'crazy' and chart NEW UNDISCOVERED TERRAIN, something that happens very seldom on most other Search engines. I personally hope that AV will retain is roots, and crawl and list as it does now, even though the net is becoming more and more commercial, notwithstanding all our efforts.

If you are interested in this kind of stuff (hopefully not in order to spam on your own :-) you may want to reverse some algos by yourself and have a general look around at the various spammers newsgroups (yes, they exchange their findings on the web: 'I got three clients into the top AV 10, but cannot get them in WC" and so on).
Anyway, one of the results of this awful "commercial oriented" activity (these assholes would sell their sisters for a couple of clicks) is that MOST OF THE TIME you can and must FORGET the first 10% positioned links of any query result you'll get. Yes, you understood me right: all search engines anti-spam tricks notwistanding (at the moment the more refined antispammer is probably Infoseek, since it was for ages the most spammed SE) the first results of any query will NOT be relevant, because of the spam (unless your query is very specific).
Well, if you don't believe me, just try it: the more 'broad' your search category, the more useless will be all links that have been reported in the first positions. You'll have more luck, probably, when you start after the first relevant 10%.

There are a couple of tricks you can use:
Jump the first results
Say you have searched for instance three terms, and you get term1 100000, term2 2000, term3 40000.
Now you know that 'first broad relevant' palette is 2000 (term2 findings). 10% is 200, you may begin your search at 'page' 20 of your results, don't worry, you won't loose much.
Negate spammers
Have a look at the first page, see if there are several hits from the same spammer among the first 10, say you see three hits from http://www.spammon.com, just add to the search string (that should still be inside the search window) -spammon and have a look at the first 10 hits you get NOW. Repeat until necessary.

Punish Spammers
An interesting idea is to 'punish' spammers sites on your own, simply resubmitting them with a lot of hidden spamming text that YOU have added... The search engines algos will exclude them; mostly without warning. :-)
Alternatively you can email the search engines with your favourite spammer target, just remember to keep it very factual and to the point. Also, maintain a professional level. No all caps screams, no judgements about it. Just the facts: "On such-and-such search, such and such commercial domain is dominating the results, disallowing the searcher access to a varity of real results from which to choose."

Write your own search bots
Uff! It's a long way to come to term with the search engines, isn't it? Well, there's ANOTHER way: write yourself your dedicated search bots, it's easier than you may think, and works MUCH better than being delivered to the whims of the (at times comical) algos of the main SEs.
When you'll have learned how to write your own search bots (study perl, my son) you'll be able to incorporate these tricks inside your own probes... yes... as you now probably understand: public search engines may be used if you have nothing better, but your own search bots will be much better!

They will explore uncharted waters MUCH MORE than the commercial oriented search engines. Unfortunately there's nowaday a ugly trend towards 'sampling' and away from 'real indexing', of course the commercial slave-catchers could not care less about information, contents or knowledge: what they want is a 'cow-park' of million of drooling lusers that keep clicking around the same bogus useless sites for the eternity, so why should they index the real web when they can have the same money from much less work?
They will not follow algos you don't control;
They will never allow paying sites to emerge among the first positions (as many search engines do);
They will not be so easily spammed;
They will find exactly the info you need and follow the links you (and not some American commercial slave-catcher) will have decided!

Enjoy!

Fravia's tips for your own site

OK, there have been so many that have written me praying for some tips in order to ameliorate the ranking of their own pages, that -even if I don't really agree- I am going to publish the following. I sincerely hope that most of my readers are just 'correct' people that need a boost for their clever reversing site, not commercial spammers that will misuse this info for money. You may also notice that I -purposely- DO NOT use these same tips in order to boost the position of my own site (nor I advertise on usenet for it) since I DO NOT WANT TO BOOST MY SITE TOO MUCH. I believe in fact that the quantity of people reading my site, but the QUALITY of that people makes the real difference (as usual on all sectors of life). Anyway here we go and let's just hope you won't use this to sell some pathetical crap on the web... btw, the following is strongly Altavista geared...

Use <h1>key words</h1> header first. Include only targeted key words - no stop words.
Use one short paragraph after the header with the key words first.
Repeat the key word phrase twice in the first 25 words of the page.
AV stems words so "crack cracking cracker" counts as three "crack".
Repeat the key word(s) no more than three times: software reversing software protections software cracking is ok.
The order of key words in a key word phrase is not important. "software reversing" or "reversing software" it's the same.
Alt tags are indexed.
Comment tags are indexed.
<a HREF="mailto:luser@aol.net">key words and <a href>key words</a> are indexed.
meta tag key words: keep it short use comma-separated words. It depends how people will search you, though. Since the masses predominantly search in "lower case", and use simple one and two word phrases, without quotes surrounding the phrase, you can probably leave it, yet if you believe that most of the people searching for you will know the searching ABC, then your keyword phrases (in AV) should be in their own quotes eg. "reverse engineering" instead of
Keep key words near each other and near the beginning of the page.
Make the key words 3-5% of the total text which is visible by the browser. (View your page with the browser. Start in the upper left hand corner of the page and highlight all the text by dragging the mouse cursor to the end of the page. Copy/paste to an html editor or word processor and count

This said, "tip" sheets are not much worth, since the SEs algos are continuously improving, and obvious spamming techniques will get you nowhere. Most "tip" sheets on SE positioning, that you will find around the web can be paralleled to buying a "tip" sheet at a race track. They are usually wrong. Trust your instincts, study and reverse as much as you can, and experiment with different strategies...

Back to the search Lab

Search fravia's site ~

How to search ~

Search engines light form

Is reverse engineering legal?