Understanding Search Engines
The word search engine resonates with Google, one of the most powerful and popular web-searching mediums in use. Any query typed into the Google search bar returns hundreds of corresponding web pages. The lesser-known fact however is that the technology backing Google’s incredible ability, falls into a category of search techniques employed to carry out a swift exploration.
The traditional method of an inquiry into the search box is followed by:
- Search through the search engine database
- Identification of relevant web pages
- Display of Search Engine Result Page (SERP)
All search engines strive to provide the service of delivering relevant pages from the World Wide Web but the manner in which listings are generated differ based on the kind of search engine and algorithms used. The main types of search engines and how they work are:
- Crawler based Search Engine: These search engines have three primary components in general:
- The Crawler or Spider: Spiders are software agents or robots deployed to travel through the web and generate a list of words as phrases together with where they occur (URL) – a process called crawling. The spiders begin at popular pages or heavily used servers and follow every link available at the site. In this way, the spiders penetrate through the web to flood the database of the search engine. The spiders return to these sites at regular intervals to look for updates. The dynamic world of the Web is continuously being crawled by these walkers to keep the engines running efficiently.
- The Indexer: All information retrieved by the spiders in the database – the list of phrases along with the URL is encoded and organized into a comprehensible structure called the index. The data structure generally used for this is a hash table, hashmap or inverted index. The inverted index data structure is efficient in keyword-based queries and makes information retrieval convenient, much like the index found at the end of most textbooks. The indexer thus stores words together with their occurrence in various locations and weights assigned (say based on the frequency of occurrence) in an organized structure ready for retrieval.
- The Query Processor: This final component accepts the query for the search and probes through millions of entries in the index to find relevant matches. Search engines employ different computational techniques to determine the relevance of various pages that are then ranked based on page ranking algorithms and finally rendered to the user. The ranking system employed by the algorithms is dependent on various query-dependent factors (such as word count frequency, language of the document, geographical location) and query independent factors (such as popularity of document, quality of document). The final rendered SERP consists of both processed search results and paid search results.
Bing, Yahoo, Baidu, Yandex, DuckDuckGo, AOL and Ask all fall under this category of search engines.
- Human powered Directories: Next we have directory-based operations where the web links are organized into catalogs or subject directories, much like the front index in a textbook. As opposed to traditional automation, this engine harnesses human power for this categorization. The search takes place in this directory, built of websites and short descriptions. In most cases, an actual person searches through existing websites, reviews it and adds it to the directory along with the description. The various pages are sorted into topics to create a hierarchical structure with similar pages being clubbed together under the same topic and ranked based on relevance. The user search query is returned with a list of most favorable and intended description from this directory. Along with the directory search results, the final listing comprises of paid results as well which are again ranked. Any discrepancies in relevance are avoided as a dedicated human-based task force determines the web results instead of complex algorithms. The emphasis here lies on the relevance of response as the search query renders limited to web pages as opposed to traditional engines, which return thousands of web pages for a given query.
Open Directory, LookSmart, Chacha, Mahalo and even Yahoo at one point belonged to this group of search engines.
- Hybrid Search Engine: Both search engine techniques described above are opposing in their functioning and each has its own benefits. A crawler-based search engine works well for specific queries but is not as effective in providing relevant results in case of general queries. A human directory, however, provides better results in case of general queries but is unable to offer the same efficiency in case of specific queries. Thus a hybrid search engine as the name suggests combines both crawler-based search engine and directory results.
Yahoo, MSN, and Google employ this technique to render their search results.
- Meta Search Engine: These take the results of all other search engines and combine them to create a larger listing of results. By gathering the results simultaneously from the indexes of third-party search engines, these engines gather a wide range of pages. These results are processed, ranked and presented to the user. However, the number of results for a given query by this method after redundancy removal is meager and does not fully meet the user requirements.
Dogpile, Metaseek, and Savvysearch are a few examples of such meta search engines.
Besides the above-mentioned search engines, various other types of search techniques are attempting to gather user attention such as WolframAlpha – the computational search engine and Swoogle the Semantic Search Engine. With the massive extent of the World Wide Web, search engines are constantly revving to deliver instant, hassle-free and relevant responses to our incessant queries.