Understanding Elasticsearch

I’ve always wondered how search in large applications can return results quickly, return matches that are slightly misspelled, and also take relevancy into account with fetching results. It turns out that a large number of modern apps do this by making use ofuse Elasticsearch. Elasticsearch is an open sourced analytics and text search engine. It’s speed, efficiency, and ability to return the most highly relevant data makes it the top search engine that powers some of the world’s most popular apps, such as:

  • Uber

  • Github

  • Shopify

  • Yelp

  • Grubhub

  • Instacart

I’ve always been interested in understanding how Elasticsearch provides such great results as opposed to the standard query to a database, so decided to get a better understanding of it’s underlying architecture.

Overview

Using Elasticsearch, you can take data from a source, in any format, then search, analyze, and visualize it. As your app’s data grows, Elasticsearch is able to grow right alongside it as it’s highly scalable.

The most common use cases for Elasticsearch are:

  • logging

  • metrics

  • analytics

To better understand the role that Elasticsearch plays as search engine, let’s take a look at a standard request / response without it:

Here a client makes a request to the server. The server then hits the database and returns a response to the client. Because the database is responsible for holding data in multiple tables, this causes latency, or a lag when returning data. Databases also aren’t the best for full-text search.

With Elasticsearch, when the client makes a request to the server, instead of hitting the database, the server makes a RESTful API request to Elasticsearch , which returns a response and in turn and returns a response back to the client (we’ll take a look at what makes this transaction quite fast in a bit):

Clusters & Nodes

A node is an instance of Elasticsearch. When a node is created, a cluster is automatically formed. Each cluster can have one to many nodes. Each node has a unique id and name and belongs to a single cluster:

Documents & Indices

It’s important to understand that nodes can have one or multiple roles and can hold data. Data in a node is stored as a document in a JSON format. We can think of each document as being similar to a record in MongoDB or a row in a relational database, like MySQL or PostgreSQL. Each document has it’s own unique id.

Documents that share a common category are grouped together in what is an index. For example, if we were building an online store, we might have indices for books, movies, and video games:

When searching, we specify the index as search queries are run against indices.

Shards

An index are not actually responsible for storing documents. They keep track of where documents are stored. Shards are where data is stored.

When an index is created, one shard comes by default. You can configure your cluster to have multiple shards that are distributed across nodes. This practice is known as sharding. This is what makes Elasticsearch highlight scalable — the ability to horizontally scale by adding more nodes and shards as your data grows.

The number of documents that a shard can hold depends on the capacity of the node. For example, if we had 800k documents that we wanted to be indexed, but our nodes have a limit of holding 200k, we can create more nodes and shards within those nodes:

Speed & Resiliency

One of the main questions I had about Elasticsearch is how it handles querying so quickly. Let’s say we had one million documents. We have a couple of options on how to store these documents within Elasticsearch:

  • one million in a single shard, on a single node OR

  • 100k distributed amongst 10 shards, over 10 nodes:

When Elasticsearch runs a query, it will run it against our single shard sequentially. Let’s say it takes 10 seconds to run in our example of a single shard on a single node. We would assume that it would take just as long to run the same query on shards that are distributed across nodes — like in our example of 10 nodes, each with it’s own shard holding 100k documents. However, when we distribute our data, Elasticsearch is able to run the same query in a total of 1 second because it runs the query in parallel across all the distributed shards — making it MUCH faster. This is great, as your data grows, and Elasticsearch can continue to search at scale.

Resiliency & Data Loss

As it likely to happen when dealing with technology — things are bound to fail. What happens when one of our nodes goes down? When a node goes down, we run the risk of losing our data. To solve this issue, Elasticsearch allows us to create replicas (copies) of our original shards. It does this by storing replicas across different nodes, preventing data loss:

Replicas also provide the added benefit of improving search. Let’s say your app receives an influx of traffic and as a result, a spike in the number of queries that Elasticsearch needs to handle. By distributing replicas across nodes, Elasticsearch can manage this increase in demand by reading from those distributed replicas.

Conclusion

I hope this overview paints a general picture of how Elasticsearch works under the hood. There’s a number of other layers that comprise what’s known as the Elastic Stack. I’m looking forward to diving in and learning more about it in the near future.

Previous
Previous

Optimizing N+1 Queries in Rails

Next
Next

Cleaning Complex State With React’s useReducer