Open In App

Elasticsearch Search Engine | An introduction

Improve
Improve
Like Article
Like
Save
Share
Report

Elasticsearch is a full-text search and analytics engine based on Apache Lucene. Elasticsearch makes it easier to perform data aggregation operations on data from multiple sources and to perform unstructured queries such as Fuzzy Searches on the stored data. It stores data in a document-like format, similar to how MongoDB does it. Data is serialized in JSON format. This adds a Non-relational nature to it and thus, it can also be used as a NoSQL/Non-relational database. A typical Elasticsearch document would look like:

{
  "first_name": "Divij",
  "last_name":"Sehgal",
  "email":"xyz@gmail.com",
  "dob":"04-11-1995",
  "city":"Mumbai",
  "state":"Maharashtra",
  "country":"India",
  "occupation":"Software Engineer",
}
  • It is distributed, horizontally scalable, as in more Elasticsearch instances can be added to a cluster as and when need arises, as opposed to increasing the capability of one machine running an Elasticsearch instance.
  • It is RESTful and API centric, thus making it more usable. Its operations can easily be accessed over HTTP through the RestFul API so it can be integrated seamlessly into any application. Further, numerous wrappers are available in various Programming languages, obviating the need to use the API manually and most operations can be accessed via library function calls that handle communication with the engine themselves.
  • Through the use of CRUD operations – Create, Read, Update, Delete – it is possible to effectively operate on the data present in persistent storage. These are similar to the CRUD achieved by relational databases and can be performed through HTTP interface present in the RESTful APIs.

Where do we use Elasticsearch?

Elasticsearch is a good fit for –

  • Storing and operating on unstructured or semi-structured data, which may often change in structure. Due to schema-less nature, adding new columns does not require the overhead of adding a new column to the table. By simply adding new columns to incoming data to an index, Elasticsearch is able to accommodate new column and make it available to further operations.
  • Full-text searches: By ranking each document for relevance to a search by correlating search terms with document content using TF-IDF count for each document, fuzzy searches are able to rank documents by relevance to the search made.
  • It is common to have Elasticsearch to be used as a storage and analysis tool for Logs generated by disparate systems. Aggregation tools such as Kibana can be used to build aggregations and visualizations in real-time from the collected data.
  • It works well with Time-series analysis of data as it can extract metrics from the incoming data in real time.
  • Infrastructure monitoring in CI/CD pipelines.

Elasticsearch Concepts Elasticsearch works on a concept known as inverse indexing. This concept comes from the Lucene library(Remember Apache Lucene from above). This index is similar to terms present at the back of a book, that show the pages on which each important term in the book may be present or discussed. The inverted index makes it easier to resolve queries to specific documents they could be related to, based on the keywords present in the query, and speeds up a document retrieval process by limiting the search space of documents to be considered for that query. Let’s take the following three Game of Thrones dialogues:

  1. “Winter is coming.”
  2. “A mind needs books as a sword needs a whetstone, if it is to keep its edge.”
  3. “Every flight begins with a fall.”
  4. “Words can accomplish what swords cannot.”

 Consider each of these dialogues as a single document, i.e, each document has a structure like:

{
    "dialogue": "....."
}

 After some simple text processing: After lowercasing the text and removing punctuations, we can construct the “inverted index” as follows:

Term Frequency Documents
a 4 2, 3
accomplish 1 4
as 1 2
begins 1 3
books 1 2
can 1 4
cannot 1 4
coming 1 1
edge 1 2
every 1 3
fall 1 3
flight 1 3
if 1 2
is 2 1, 2
it 1 2
its 1 2
keep 1 3
mind 1 2
needs 1 2
sword 1 2
swords 1 3
to 1 2
what 1 3
whetstone 1 2
winter 1 1
with 1 3
words 1 4
  • The first two columns form what is called the Dictionary. This is where Elasticsearch searches for the search terms to get to know which documents could be relevant to the current search.
  • The third column is also referred to as Postings. This links each individual term with the document it could be present in.

Few common terms associated with Elasticsearch are as follows:

  • Cluster: A cluster is a group of systems running Elasticsearch engine, that participate and operate in close correspondence with each other to store data and resolve a query. These are further classified, based on their role in the cluster.
  • Node: A node is a JVM Process running an instance of the Elasticsearch runtime, independently accessible over a network by other machines or nodes in a cluster.
  • Index: An index in Elasticsearch is analogous to tables in relational databases.
  • Mapping: Each index has a mapping associated with it, which is essentially a schema-definition of the data that each individual document in the index can hold. This can be manually created for each index or it can be automatically be added when data is pushed to an index.
  • Document: A JSON document. In relational terms, this would represent a single row in a table.
  • Shard: Shards are blocks of data that may or may not belong to the same index. Since data belonging to a single index may get very large, say a few hundred GBs or even a few TBs in size, it is infeasible to vertically grow storage. Instead, data is logically divided into shards stored on different nodes, which individually operate on the data contained in them. This allows for horizontal scaling.
  • Replicas: Each shard in a cluster may be replicated to one or more nodes in a the cluster. This allows for a failover backup. In case one of the nodes goes down or cannot utilize its resources at the moment, a replica with the data is always available to work on the data. By default, one replica for each shard is created and the number is configurable. In addition to Failover, use of replicas are also increases search performance.

Last Updated : 10 Feb, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads