Instructional Module W17c

Search Engine Operational Principles


to Top Overview: Anatomy of a Search Engine

to Top

Web search engines are made of three parts:

  1. User Interface, or "front-end"
    This is the part we see. Its function is to take our request for information, get the information from the database, put it in order, and send it to us.
  2. Database, or "back-end"
    Here, vasts amounts of information are processed and stored in carefully-organized form to make it quick to retrieve.
  3. Crawlers, also known as "spiders" or "robots"
    These agents systematically move around the Web harvesting information.

Figure 1 shows the relationship between the three parts.

Figure 1: Anatomy of a Search Engine
Figure 1

Read:
"How Search Engines Work", by Danny Sullivan of Search Engine Watch: http://searchenginewatch.com/webmasters/article.php/2168031


to Top Crawlers
What They All Do

to Top

to Top

Crawlers are software programs that run on computers belonging to the search engine. (They don't actually move from one computer to another.) The job of the crawlers is to automatically harvest information. This is the general pattern of their actions:

  1. Start at a page determined by humans to be useful.
  2. Send the page back to the Search Engine to be processed.
  3. Follow links to other pages, sending back information and following links on them.

In this way, information from millions of Web pages is sent back to the search engines.

Keeping Crawlers Out

Some Websites prefer not to appear in search engines. There are two ways to ask the Web crawlers not to enter your site:

  • At the root of your Website (public_html for many servers) you can put a file named robots.txt with information about which crawlers are forbidden to enter specific directories. If you're interested in more information, visit the Web Robots Pages, http://www.robotstxt.org/wc/exclusion.html
  • In the head of your Web files, include this meta-tag:
    <meta name="robots" content="noindex, nofollow">
    If you'd like the details, consult this page: http://www.robotstxt.org/wc/meta-user.html
Differences

to Top
to Top

Search engine crawlers aren't all programmed the same. There are a number of ways in which their designers try to make them better, or more specialized.

  • Where do they start?
  • What sort of information do they bring back from each page?
  • How far do they go when following links?
  • What sort of site do they visit - all, or only certain kinds?

Read:
"Web Robot FAQ" page at Robotstext.org: http://www.robotstxt.org/wc/faq.html



to Top User Interface
What They All Do


to Top

What is a hit?
A hit is an entry in the database that matches one or more of the search words entered by the user.

to Top

The User Interface (UI) is what we see when we use a search engine. It's more than a pretty page, of course! The UI is responsible for these jobs:

  • Welcome the user and make it easy to figure out how to use the system;
  • Accept user input and parse it into recognizable words and commands (such as quoted strings, "+", "-", AND, OR);
  • Find relevant hits in the database;
  • Compute the degree of relevance of each hit;
  • Sort the hits into order by relevance (and possibly slip a "sponsored link" in the right place);
  • Show the list of hits to the user.
Differences

to Top
to Top

Differences in the UI are the most obvious differences between search engines. They include:

  • Complexity of the starting page;
  • Type of commands recognized;
  • How relevance is computed;
  • Order in which hits are presented - there may be grouping into categories;
  • Which facts are presented about each Web page;
  • Visual style of the page and additional material presented there.

Read:
Search Engine Watch's Search Tips: http://www.searchenginewatch.com/facts/index.php

 
to Top Who's Out There?
Search Engines
to Top
to Top
GOOGLE BULKS UP AS COMPETITION LOOMS

Google added an additional 1 billion pages to its Web index yesterday, increasing the number of pages it indexes from 3.3 billion to 4.28 billion. The search leader said it also had doubled the number of images in its index from 400 million to 880 million. Even those impressive numbers don't come close to covering the whole Web, however, which is estimated at somewhere around 10 billion pages. Meanwhile, rivals Yahoo and Microsoft are girding for battle. Yahoo plans to dump Google as its search engine and switch over to technology acquired through its purchase last year of Inktomi and Overture. At the same time, Microsoft is spending millions to develop its own proprietary search engine to use on MSN.com. According to comScore Media Metrix, Google's Web sites handled 35% of all Web searches in December, while Yahoo claimed 27% and Microsoft 15%. AOL and other Web sites owned by Time Warner made up 16% of the market. (AP/Los Angeles Times 18 Feb 2004) Quoted in NewsScan Daily, 18 February 2004

There are a few really well known search engines, and many special purpose ones. In addition, there are indexes or directories of Web content. To keep track of what's out there now, the best source of information is Danny Sullivan's Search Engine Watch.

Read:
Search Engine Watch's Reviews of Search Engines, http://searchenginewatch.com/resources/article.php/2156581

Meta Engines

to Top
to Top

With all the differences between search engines, it's often a good idea to ask more than one. This is where the meta-search engines come in. They submit your search to multiple search engines, and organize the results for you.

There are two kinds of meta engines:

  • on-line: they look just like a regular search engine, through a Web browser;
  • desktop: these are tools that are downloaded and installed on your computer; there are two kinds:
    • toolbars: programs that install themselves inside another program - usually a browser;
    • stand-alone: programs that you run as separate applications.

Read:
Search Engine Watch's Reviews:
Online Meta-search: http://www.searchenginewatch.com/links/article.php/2156241
Downloadable Metas and Utilities: http://searchenginewatch.com/links/article.php/2156381


to Top About This Document
Review Button

Click here for review questions.

Audience

to Top
to Top

This module is for people who are familiar with using search engines and want to know more about how they work.

 

Objectives

On successful completion of this module, you will be able to:

  1. List the three main components of any search engine, and explain the purpose of each;
  2. Explain how a search engine finds information;
  3. Discuss the factors that distinguish search engines in how they find information;
  4. List the types of information that may be stored in search engine databases;
  5. Define "hit" in the context of search engine use;
  6. Discuss differences between the way search engines handle search terms, including normal, advanced, and the use of boolean operators;
  7. Explain methods for ranking hits;
  8. Discuss the impact of commercial interests on search engines;
  9. Discuss the purpose of meta-search engines;
  10. Explain the types of meta-search engines;
  11. List at least two of the major on-line and downloadable meta-search engines;
  12. List at least two of the major general-purpose Web search engines;
  13. Describe the functions of a search engine’s “noise word” list;
  14. Describe how to focus a search engine’s hit list on desired words and phrases;
  15. Describe the “advanced search” functions of search engines;
  16. Describe the ranking algorithms of major search engines;
  17. List at least two sources of search engine ratings and technical comparisons;
  18. Discuss ways in which search engines can be used to provide good exposure to a Website.

to Top
Module W17c: Search Engine Operational Principles
This document is part of a modular instruction series in Computer Instruction. For more information, see the overview or the list of modules in this series, W: World Wide Web.. This document has been used in the following classes: INP 160.
History:
Original: 16 October 2003, by Laurence J. Krieg
Last modification: Thursday, 18-Nov-2004 21:45:19 EST
Copyright:
Copyright © 2003-2004, Laurence J. Krieg, Washtenaw Community College
Instructors: You may point to this file in your Web-based materials; however, its location may change without notice.
Students: You are welcome to make a copy for your personal use.
All other uses: Please contact the author, Laurence J. Krieg, for permission: krieg@ieee.org.

to Top