home | feedback | Spins | Technology | Web

All About Web Search Engines

by the Gus

Web search engines maintain databases of web pages catalogued for text content. Sometimes the HTML content (sites referred to, font sizes and colors, etc.) is also catalogued in data bases. The databases themselves are large and complex data structures that can be searched rapidly for strings of words.

The databases are built by robots, autonomous programs that surf the web and catalogue the pages as they follow links. The databases also receive input from URLs submitted manually by page authors.

How a page is catalogued, and in what order it appears in results returned by a search engine is related to a combination of factors, namely:

the content of the page
in some cases supplemental content contained in META tags
the relevance rules of the search engine
the frequency of database updates and the activity level of robots

META tags

META tags come in many flavours, but the two most important ones are in the form of

<META name="description" content="string"> where string is the description of a URL returned by a search engine that has catalogued the page.

and

<META name="keywords" content="string"> where string is a series of supplemental keywords that do not appear in the text but that should be taken into account when a page is added to a search engine database. A good use of this feature is to include other conjugations of verbs and other forms of nouns from the forms actually found in the document so people looking for, say "dinosaurs" find your document even though you only wrote of a "dinosaur."

Meta tags are placed between the <HEAD> and </HEAD> tags of an HTML document.

relevance rules

The relevance rules vary from search engine to search and can change in any given search engine from time to time. The relevance rules are strictly guarded trade secrets and must be determined by careful experimentation when submitting URLs. This requires a lot of testing. But since the rules change, the testing must be ongoing. I do much testing of search engines and can give a fairly good impression of what the rules are as I write. It is easiest to test the relevance rules of sites that respond rapidly to a site submission. For this reason, Altavista has the most understood relevance rules of any search engine. Here are some known relevance rules:

Altavista-gives greatest relevance to the titles as defined by <TITLE> tags. After that it gives priority to headers, starting with header level 1 <H1> and moving up. Last of all it gives priority to the contents of text and META tags, with greatest weight given to text at the top (any META tags). This heirarchy defines the skeleton of the result list returned by Altavista. Within each of these level, timeliness (when a site was submitted) determines the order of the results. In the past the latest submissions were returned first, but that resulted in much promotional material cluttering the database. So in December 1996 Altavista inverted the relevance of timeliness. This had lots of ramifications. For me it means that most of the hits from people to my sites from Altavista come from people with fetishistic depravities. Descriptions in description META tags are returned as summaries.

Webcrawler-a primitive search engine that appears to give relevance to simply the number of occurrences of a string in a document without regard to placement in the document. This means large pages consistently are at the top of results returned from Webcrawler. Webcrawler ignores META tags and timeliness is not a consideration. Machine-calculated samples of text are returned as summaries.

Infoseek-This search engine appears to give extra weight to META tags, but not to titles or other heirarchies of text. Key words appearing near the beginning of a file have high relevance values. This works to the advantage of straight text files since HTML files must begin with and other strings that do not relate to the content. Simple repetition of keywords in the text or META tags seems to work miracles in getting sites to the top of relevance. Timeliness is a weak factor at most. Descriptions in description META tags are returned as summaries.

Magellan-gives greatest relevance to the titles as defined by <TITLE> tags. Then it gives priority to keywords which must be entered when submitting a site to Magellan. Finally it takes into account to the number of instances of words in the text. Timeliness is not a consideration. The first few dozen words of text are returned as summaries. Magellan is more upfront about their relevancy rules than most and have included some information in their FAQ.

Hotbot-This search engine appears to give some weight to META tags, but not to titles or other heirarchies of text. Timeliness may be a factor, but it is not a big one. The ratio of a searched for word to other words in the content is decisive in determining relevance. Descriptions in description META tags are returned as summaries.

Lycos-appears to give relevance to the number of occurrences of a string in a document within the first 200 words of the document. Lycos appears to ignore META tags. Somewhat more relevance is given to the latest submissions. The first few dozen words of text are returned as summaries.

Excite-appears to give relevance to the number of occurrences of a string in a document with most relevance to contents near the top of a page. Timeliness may or may not be an issue. Excite's robot spider is especially active in changing sites. Excite appears to ignore META tags. Machine-calculated samples of text are returned as summaries.

Opentext-appears to give relevance to the number of occurrences of a string in a document with most relevance to contents near the top of a page. The first few dozen words of text are returned as summaries.

Yahoo-Unlike most modern web databases, Yahoo is assembled by real humans. Relevance is returned in a hierarchical way with rank being the whim of the individuals compiling the index. Brief human-written descriptions or nothing at all are returned as summaries.

the frequency of database updates and the activity level of robots

Altavista-the spider (Scooter) lies dormant for long periods but the database is updated nightly with the manual submissions made that day. There is a limit on the number of submissions per domain per day.

Webcrawler-both the spider and updating process seem to have been completely dormant since the Summer of 1996. For this reason URLS in the index are few in number and out of date.

Infoseek-there doesn't appear to be a spider associated with Infoseek, but it updates its database regularly, every two weeks or less.

Magellan-unlike most of the other search engines, Magellan is not a search engine whose database is constructed with the help of robots. Like Infoseek, it relies wholly on submissions by web authors. Magellan does not appear to update its index very often.

Hotbot-the spiders are not particularly active for Hotbot, but Hotbot updates its database regularly.

Lycos-Lycos spiders sites that are submitted, but does not track links off the server of sites submitted. The database is updated every few weeks or so.

Excite-has the most active spider of all, and so its database is always very current. Submissions to the Excite database are also accounted for relatively rapidly, in a week or two.

Opentext-both the spider and database updates seem to be dormant or chaotic at best.

Yahoo-these days, to get in this database requires being a part of the Web good old boy network: being mentioned in magazines or by others regarded as cool in the hot web sites. It's also possible to get your URL mentioned in Yahoo when new levels in the heirarchy are created. You see, Yahoo's human surfers consult other search engines (particularly Altavista) to find URLs to place on the new pages.

page submission guidelines

Altavista-pages submitted over the preceding 24 hours at Altavista's Add URL Page are added to the index typically around midnight Eastern Standard Time every day. Altavista is not easily fooled by repeated instances of a keyword either in a META tag or in the body of a web page. But if a term is springled throughout a section of text or throughout a META tag, the page will be highly relevant in a search for that term. There is a daily limit on the number of pages that can be submitted from a particular machine within a domain.

Infoseek-now provides a form for the submission of individual web pages and it updates its index almost instantly. However, it will not accept pages into its index if it determines (by artificial intelligence) that the page is designed as promotional material either for other pages or for extra-high relevance for particular keyword searches. Things that tip it off include:

the use of a META auto-refresh in the form of <META HTTP-EQUIV="refresh" CONTENT="[time of delay in seconds];URL=[address of URL]>
the presence of text in the same colour as the background (a way to hide keywords)
the obvious repetition of keywords

Pages cannot be resubmitted until 24 hours has passed.

back to the top