fravia's search engines' vagaries
Fravia's Nofrill
Web design
updated November 1998
Search engines' vagaries
First elements of searchenginology - 1
Well, ever wondered which among the search engines get more hits, which one
does index more pages, which SEs
are more spammed (where the relation between "noise" and relevant information is worse)?
And did you even wonder why there's so much opportunistic spam among
the first links that most
SE return (especially when too broad queries have been performed)? And did you ever think that each
search engine has its own algos in order to refresh/accept/select/classify/list the results
of your queries... and that these same algos can be (pretty easily) cracked... and that
many commercial 'slave-catchers' are doing "professionally" exactly that,
in order to scrap some money
from your useless clicks (destroying at the same time the value of the
search engine services for million users)?
Good old fravia+ will now explain all this (at least in part), so that
you may choose and use
better the SEs... and survive on the web!
[Some data]
[Spam for clicking]
[Reversing search spiders]
[Write your own search bots]
[Fravia's tips for your own site]
Some data
Main SEs: Indexed pages, web-coverage and number of monthly visitors
(font: fravia's scripts, results for mid July)
Search engine Indexed pages web coverage monthly visitors
Altavista (AV) 150 Mil 42,9% 7 Mil
Hotbot (HB) 120 Mil 34,3% 5 Mil
Northernlight (NL) 85 Mil 24,3% 3 Mil
Excite (EX) 50 Mil 14,3% 15 Mil
Infoseek (IS) 33 Mil 9,4% 13 Mil
Lycos (LY) 31 Mil 8,9% 10 Mil
Webcrawler (WC) 3 Mil 0,9% 6 Mil
Yahoo (YA) not not 32 Mil
What can we conclude from these data?
That librarian and researchers and anyone
seeking hard-to-find information (like us) should by all means use Altavista or Inktomi
(that powers Hotbot) or Norternlight.
- That even Altavista, that indexes 140 million pages and crawls more than
10 million
pages per day (like HB and LY... most of the time to refresh
links or to add manually submitted sites, NOT
to search uncharted waters!) does only index LESS THAN HALF of the web! This makes it very important
that you learn the OTHER search techniques (combing, klebing, etc.) that you'll find on my
how to search page!
- That people don't use nor access search engines based on their real or supposed quality, but
rather moronically following the pre-chewed links that they find inside their
browsers (AOL-browser,
Netscape search page, etceteras). So the number of monthly visitors does
not have ANY interest whatsoever
in order to judge the quality of a search engine (actually the contrary, since
the more visitors a SE has, the more incentive there is for the slave-catchers for
putting commercial spam inside).
- This said there are also other factors in play: 'link popularity' for instance should play a role
when indexing, yet only HB, LY and WC take account of it; unfortunately
not all search engines show the DATE of the
fished links (AV, HB, IS, LY and NL do, EX and WC do not), which is useful for you in order to
'guess' which info may be more fresh and up to date when you peruse your results. Text positioning plays
also a role for many search engines, as you'll read below.
Spam for clicking
Spam is, in the case of the search engines, something different from the usual email spam that
you already hate, yet it is nevertheless an opportunistic crap. There are people all over the world,
that scrap money luring lusers and zombies into clicking banners
(after having sold to their
unhappy clients the lie that clicking means 'commercial opportunity').
Those commercial slave-catchers have quickly understood a couple of simple truths:
- 1) Lotta people use the Search engines (duh)
- 2) Each search engine uses more or less simple
algorithms to index the links resulting from a query.
- 3) It is VERY important to be listed among the top 10 links reported by a search
engine, because zombies and lusers, that dunnow nothing about searching, just go for it
immediately instead of having a more thorough look at the results (or even, as I advice, instead
performing the same query using MORE THAN ONE search engine and evaluating the
answers BEFORE running into sideways).
- 4) There's no need to have a try for quality: just create (for instance) hundred pages that
click to each other and that carry text and tricks INTENDED TO FOOL THE SE's often por 'quality'
The consequence is that you'll find often enough as answer to your queries
only 'commercial' pages, that HAVE BEEN DESIGNED in order to figure among
the first positions, instead of finding the true
knowledge sites you're looking for, that don't care for this crap 'positioning'.
It's as simple as that: the search engines that you are
using are not giving you what you expect, nor what they are supposed to, they are
just giving you tons of commercial spam. And millions of users are wasting time,
so that the commercial slave-catchers can scrap some money, as ususal...
That's a reason good enough to cross the commercial
SE's spammers' planes, annoy them and eventually
destroy their soo artefully designed sites, and I'll teach you how to do it...
Reversing search spiders
Any real reverser can quickly fool the SE's algos (commercial spammers, that are pretty stupid
individuals, actually do it all the time - for money).
Algos vary from search engine to search engine (of course, that's
the reason the same query gives DIFFERENT results on each search engine, btw) and
may be very simple or complex.
The simplest way to reverse the SE's algos is to perform searches on very commercial
subjects (those where the spammers from the 'insert site consulencies' battle a lot)
and have a look at the first 15-30 sites that you'll get as result of your query. Your
reversing blick will quickly pick up the relevant patterns...
Let's take as an example Excite, in order to let
you grasp the complexity (and at the same time the banality) of the issues at stake.
The following is taken from my own essay
reversing search engines' bots:
Excite: if you
try to submit more than twenty pages then its spider will begin to penalize you.
- Excite: is
undergoing an indexing change right now, and in the mean time seems to have
TWO spiders: a "roaming spider", now only looking for homepages, whose pages have a time limit
(they will be drop after that) and a "Fresh Spider" which confirms page changes and
submittals. It seems that on 21/7 the older algos took over and Excite
started reindexing
subpages (and not only index pages).
- Excite: has NOT all ist stuff stored on one huge computer, of course,
but onto several. On one search
one computer may be down, or overloaded, therefore you may get different answers for the
same search on Excite.
- Excite: Since Excite has the habit to 'drop' pages, in order to spam
most commercial slaves use (and need) two domains. They build say 10 doorway pages on domain A,
and make sure that they have a page on domain B that links to each of those doorway pages.
- Excite: Since Excite is heavily used by the AOL types you should use alt tags,
to identify buttons, etc, not to keyword stuff.
- Excite: Very sensible to spam from comment tags (at the bottom of the spammer's page),
with complete sentences structured around his 'targeted keyword phrase' the searchstring where he
hopes to appear among the first ten when queried.
- Excite: 'Relevant' links play a role in Excite's algos: pages that come up on
top are largely composed of a list of links to other pages.
- Excite's algos punish spammers that repeat keywords closer than 7 words apart
- Excite's algos give special attention to the bottom of the page, believing
that most spammers place extra emphasis on placing words early in the document
(top of the page). Notice how Excite spammers do not put any copyright or other nonsense
after having placed their last set of spamming keywords, which are always the last thing on the
- Excite falls for links named after the spammers' keywords (especially if on the bottom middle of
the page).
- Excite falls for keywords put in the ALT tags of bullets.
Excite is using IMO 4 different algorithms, the two major ones are
easy to see: each cycle lasting approx two weeks. Excite reversers can actually
observe the "switch" on Monday morning.
- Excite: hidden text will get spammers banned from Excite,
never to return again (they are very unforgiving).
The above snippets are taken from the 'Excite' chapter of my
own (unpublished) 'reversing search engines bots' small booklet, yet every single
Search engine has its own idiosynchrasies.
Altavista, for instance, seems to be
rotating at least two ranking methods several
times during the course of a day, in order to keep spammers at by. On Altavista,
quite correctly, "root domains" get a
relevancy boost, which frustrates spammers. But many other small things seem also to be
in play. I believe, for instance, that font size increases keyword weight on some
AV algos (because
they assume that parts of text written in bigger fonts are more relevant for the page).
Also If you try to submit more than 60 pages for a given domain, AV kills it... how do the
spammers then spam in this case? They use a unix shell
and start a lynx -dump. So they can submit
20.000 pages in 4 hours from only one server.
Altavista searches use moreover different algorithms depending of the
PART of the database they are falling in: The main AV page uses one algorithm, yet the
small AV search panel in Micro$oft's Explorer 4 and up (yes, you'll have to lower yourself to
use this puke browser too, if you want to fish algos on the search engines), uses a
different algorithm at AV. You can tell that the M$IE searches are different, because
they will have a "n200" in the referer string in your logs.
AltaVista algos are based on the oldest search engine, the one in use on the
web of the 'older ones', and AV is THEREFORE still the best place to find
detailed information, even if the submitters (or spalmmers) never indended to
list it that way. AV bots still follow all links they find, and sometime they
go 'crazy' and chart NEW UNDISCOVERED TERRAIN, something that happens very seldom
on most other Search engines. I personally hope that AV will retain is roots,
and crawl and list as it does now, even though the net is becoming more and more
commercial, notwithstanding all our efforts.
If you are interested in this kind of stuff (hopefully not in order to spam on your own :-)
you may want to reverse some algos by yourself
and have a general look around at the various spammers newsgroups (yes, they exchange their
findings on the web: 'I got three clients into the top AV 10, but cannot get them in WC" and so on).
Anyway, one of the results of this awful "commercial
oriented" activity (these assholes would sell their sisters for a couple of clicks)
you can and must FORGET the first 10% positioned links of any query result you'll get.
Yes, you understood me right: all search engines anti-spam tricks notwistanding (at the moment the more
refined antispammer is probably Infoseek, since it was for ages the most spammed SE) the first results of any query will
NOT be relevant, because of the spam (unless your query is very specific).
if you don't believe me, just try it: the more 'broad' your search category, the more useless
will be all links that have been reported in the first positions. You'll have more luck, probably,
when you start after the first relevant 10%.
There are a couple of tricks you can use:
Jump the first results
Say you have searched for instance three terms, and you get term1 100000, term2 2000, term3 40000.
Now you know that 'first broad relevant' palette is 2000 (term2 findings). 10% is 200, you may
begin your search at 'page' 20 of your results, don't worry, you won't loose much.
Negate spammers
Have a look at the first page, see if there are several hits from the same spammer among the
first 10, say you see three hits from http://www.spammon.com, just
add to the search string (that should still be inside the search window) -spammon and
have a look at the first 10 hits you get NOW. Repeat until necessary.
Punish Spammers
An interesting idea is to
'punish' spammers sites on your own, simply resubmitting them with
a lot of hidden spamming text that YOU have added... The search engines algos
will exclude them; mostly without warning. :-)
Alternatively you can email the search engines with your favourite spammer target, just
remember to keep it very factual and to the point. Also, maintain a
professional level. No all caps screams, no judgements about it. Just the facts:
"On such-and-such search, such and such commercial domain is dominating the results,
disallowing the searcher access to a varity of real results from which to choose."
Write your own search bots
Uff! It's a long way to come to term with the search engines, isn't it? Well, there's ANOTHER
way: write yourself your dedicated search bots, it's easier than you may think, and works MUCH
better than being delivered to the whims of the (at times comical) algos of the main SEs.
When you'll have learned
how to write your own search bots (study perl, my son) you'll be able to incorporate these tricks inside
your own probes... yes... as you now probably understand: public
search engines may be used if you have nothing better, but your own search bots will be much better!
- They will explore uncharted waters MUCH MORE than the
commercial oriented search engines. Unfortunately there's nowaday
a ugly trend towards 'sampling' and away from 'real indexing', of course the commercial slave-catchers
could not care less about information, contents or knowledge:
what they want is a 'cow-park' of million of drooling lusers
that keep clicking around the same bogus useless sites for the eternity, so why should they
index the real web when they can have the same money from much less work?
- They will not follow algos you don't control;
- They will never allow paying sites to emerge among the first positions (as many search
engines do);
- They will not be so easily spammed;
- They will find exactly the info you need and follow the links you (and not some
American commercial slave-catcher) will have decided!
Fravia's tips for your own site
OK, there have been so many that have written me praying for some tips in order to ameliorate
the ranking of their own pages, that -even if I don't really agree- I am
going to publish the following. I sincerely hope that most
of my readers are just 'correct' people that need a boost for their clever reversing site,
not commercial spammers that will misuse this info for
money. You may also notice that I -purposely- DO NOT use these same tips in order
to boost the position of my own site (nor I advertise on usenet for it) since
I DO NOT WANT TO BOOST MY SITE TOO MUCH. I believe in fact that the
quantity of people reading my site, but the QUALITY of that people makes the real
difference (as usual on all sectors of life).
Anyway here we go and let's just hope you won't use this to sell some pathetical
crap on the web... btw, the following is strongly Altavista geared...
Use <h1>key words</h1> header first. Include only targeted key words - no stop words.
Use one short paragraph after the header with the key words first.
Repeat the key word phrase twice in the first 25 words of the page.
AV stems words so "crack cracking cracker" counts as three "crack".
Repeat the key word(s) no more than three times:
software reversing software protections software cracking is ok.
The order of key words in a key word phrase is not important.
"software reversing" or "reversing software" it's the same.
Alt tags are indexed.
Comment tags are indexed.
<a HREF="mailto:luser@aol.net">key words
and <a href>key words</a> are indexed.